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ABSTRACT 


Across  defense,  homeland  security,  and  law  enforcement  communities,  leaders  face  the 
tension  between  making  quick  but  also  well  informed  decisions  regarding  time-dependent 
entities  of  interest.  For  example,  consider  a  law  enforcement  organization  (searcher)  with  a 
sizable  list  of  potential  terrorists  (targets)  but  far  fewer  observational  assets  (sensors).  The 
searcher’s  goal  being  to  follow  the  target,  but  resource  constraints  make  continuous  coverage 
impossible,  resulting  in  intermittent  observational  attempts.  We  model  target  behaviour  as 
a  discrete  time  Markov  chain  with  the  state  space  being  the  target’s  set  of  possible  locations, 
activities,  or  attributes.  In  this  setting,  we  define  “following  the  target”  as  the  searcher,  at 
any  given  time  step,  correctly  identifying  and  then  allocating  the  sensor  to  the  state  which 
has  the  highest  probability  of  containing  the  target.  In  other  words,  in  each  time  period  the 
searcher’s  objective  is  to  decide  where  to  send  the  sensor,  attempting  to  observe  the  target 
in  that  time  period,  resulting  in  a  hit  or  miss  from  which  the  searcher  learns  the  target’s  true 
transition  behaviour.  We  develop  a  Multi -Armed  Bandit  approach  for  efficiently  following 
the  target,  where  each  state  takes  the  place  of  an  arm.  Our  search  policy  is  five  to  ten  times 
better  than  existing  approaches. 
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Executive  Summary 


Across  the  defense  and  homeland  security  communities,  decision  makers  are  faced  with  the 
tension  between  making  quick  but  also  well-informed  decisions  regarding  issues  of 
interest  that  change  over  time. 

Consider,  for  example,  a  law  enforcement  organization  with  a  sizable  list  of  potential  terror¬ 
ists  but  a  limited  number  of  observational  assets.  We  designate  these  potential  terrorists  as 
targets  and  the  observational  assets  as  sensors.  The  sensors  being  patrol  officers,  cameras, 
or  even  small  drones.  Because  of  the  disparity  between  the  number  of  targets  and  available 
sensors,  continuous  coverage  is  impossible  resulting  in  intermittent  observational  attempts 
(hourly,  daily,  or  even  weekly).  The  goal  of  the  law  enforcement  organization,  the  searcher, 
is  to  learn  the  baseline  behaviour  pattern  for  each  target  as  quickly  as  possible.  Once  a 
reasonable  baseline  is  established,  the  searcher  would  shift  to  some  form  of  change-point 
detection,  in  order  to  detect  if  the  target  is  planning  an  attack  or  just  going  on  vacation. 

In  this  thesis  we  examine  how  to  quickly  establish  a  behaviour  pattern  baseline,  providing 
a  method  that  consistently  outperforms  the  Naive  version  in  expectation.  We  model  the 
target’s  behavior  as  a  discrete  time  Markov  chain.  The  state  space  being  the  target’s 
location,  activity,  or  any  specific  attributes  that  change  with  time.  In  the  simplest  scenario, 
the  searcher  has  one  sensor  and  in  each  time  period  decides  where  to  send  the  sensor, 
attempting  to  find  the  target,  and  resulting  in  a  hit  or  miss  from  which  the  searcher  learns 
the  target’s  transition  behaviour.  In  general,  the  searcher’s  decision  variables  are  the  sensor’s 
locations  (i.e.,  states  of  the  Markov  chain)  over  time.  The  searcher’s  objective  is  to  allocate 
the  sensor  dynamically  so  as  to  learn  the  target’s  behaviour  pattern  as  quickly  as  possible. 
Figure  1  depicts  a  simple  four  state  example  transition  kernel  that  defines  the  behaviour 
pattern  for  a  target.  We  focus  on  the  House  to  Cafe  transition  (piq). 

We  develop  a  Multi-Armed  Bandit  approach  for  efficiently  following  this  target,  where  each 
state  takes  the  place  of  an  arm.  Our  search  policy  is  five  to  ten  times  better  than  existing 
approaches  as  can  be  seen  in  Figure  2.  This  figure  corresponds  to  our  method’s  performance 
on  the  target  defined  in  Figure  1 .  The  black  line  being  the  true  probability  and  the  blue  line 
being  our  method’s  estimate  at  each  time  step  \n\. 


xv 


Figure  1.  Example  Four  State  Transition  Probability  Graph.  The  House  to 
Cafe  Transition  Being  Bolded  and  Corresponding  to  the  State  1  to  State  4 
Transition. 


Averaged  Estimated  p1  4  vs  time 
(  N  =  1  k  steps  with  100  replications  on  4  state  system  ) 


Figure  2.  Estimated  Four  State  Transition  Probabilities  (Focused  on  the 
House  to  Cafe  or  State  1  to  4  Transition)  vs.  Time  (Mean  with  95%  Con¬ 
fidence  Intervals).  Comparison  Between  Current  Naive  Methods  and  our 
Single-Miss  MLE  Approach.  Generated  in  MATLAB. 
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CHAPTER  1: 
Introduction 


This  chapter  provides  the  backdrop  for  this  thesis.  It  not  only  defines  the  problem,  why  it  is 
important,  and  how  it  is  currently  being  solved,  but  it  also  provides  the  ground  rules. 


1.1  The  Problem 

We  consider  a  situation  where  a  searcher  attempts  to  locate  and  maintain  observation  of  a 
target  (e.g.,  terrorist,  pirate  ship,  aircraft  debris,  endangered  animal,  or  generically,  a  target 
of  interest).  We  model  the  target’s  behavior  as  a  discrete  time  Markov  Chain  (a  layman’s 
explanation  of  this  can  be  found  in  Chapter  2,  Section  2.1).  The  state  space  being  the  target’s 
location,  activity,  or  any  specific  attributes  that  change  either  with  time  or  in  some  discrete 
or  sequential  fashion  (e.g.,  bank  account  activity).  In  particular,  the  states  can  be  physical 
locations,  radio  communication  activity,  the  IP  address  of  the  computer  used  by  the  target, 
or  even  the  target’s  current  bank  account  levels  or  some  specific  flagged  expenditures. 

In  the  simplest  scenario,  the  searcher  has  one  sensor  to  follow  the  entity  of  interest.  For 
instance,  at  time  t  =  7  the  target  can  be  in  locations  (or  generically,  states)  a ,  b ,  or  c.  If 
the  searcher,  at  t  =  7,  allocates  the  sensor  to  location  c  and  the  target  also  transitioned  to 
that  state,  then  the  searcher  earns  a  reward  of  one.  The  searcher’s  decision  variables  are 
the  sensor’s  locations  (i.e.,  states  of  the  Markov  chain)  each  time  step,  using  past  sensor 
response  information  to  guide  future  decisions. 

The  objective  of  the  searcher  (the  entity  attempting  to  follow  the  target)  is  to  allocate  the 
sensor  dynamically  so  as  to  achieve  as  close  to  constant  observation  (sensor  positively 
observing  the  target  at  each  time  step)  as  possible.  If  the  target’s  transition  matrix  was 
known,  then  the  searcher  would  simply  position  the  sensor  on  the  state  with  the  highest 
probability  given  the  last  observed  state  of  the  target  (which  is  not  necessarily  the  last  time 
period)  and  the  sequence  of  observed  states  which  failed  to  reveal  the  target  (the  states  that 
did  not  contain  the  target  or  the  sequence  of  misses).  In  this  thesis,  we  relax  the  assumption 
of  a  known  transition  matrix.  Hence,  the  searcher  attempts  to  learn  the  underlying  Markov 
transition  matrix  of  the  target,  constant  observation  being  the  goal. 
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1.2  Motivation:  Why  It  Is  Important 

As  briefly  mentioned  in  Section  1.1,  our  problem  can  be  applied  to  a  very  broad  range 
of  topics  or  settings  where  some  form  of  transition  dynamics  need  to  be  learned.  Here  it 
is  sufficient  to  highlight  a  couple  that  will  serve  as  surrogates  for  the  rest  of  the  possible 
application  areas.  Specifically,  we  will  expand  upon  our  primary  setting  of  a  terrorist  in 
a  city  as  well  as  possible  applications  to  learning  the  migratory  patterns  of  endangered 
animals  (birds  being  our  choice  of  example)  and  identifying  potential  pirates  hiding  within 
a  fleet  of  innocent  fishing  vessels. 

Suppose  there  is  a  person  of  interest  (the  target)  living  in  a  large  city  and  the  authorities  need 
to  determine  his/her  behavior  patterns  in  order  to  efficiently  (minimum  number  of  assets  and 
as  quickly  as  possible)  track  or  maintain  observation.  However,  in  a  typical  scenario  there 
exist  many  such  targets,  meaning  that  resource  limitations  preclude  persistent  surveillance. 
To  efficiently  track  the  target,  the  searcher  employs  various  sensors  depending  on  where  it  is 
believed  the  target  is  currently  located.  Another  reason  for  intermittent  search  is  to  prevent 
the  target  from  realizing  he  is  being  tracked.  Authorities  might  therefore  intentionally  limit 
the  use  of  sensors  following  the  target,  while  attempting  to  periodically  regain  observation 
of  the  target.  This  observation  might  be  having  a  human  asset  walk  by  the  front  of  a  cafe, 
glancing  inside  to  see  if  the  target  is  there  or  not.  We  model  the  target’s  behavior  as  a  discrete 
time  Markov  Chain  with  the  state  space  comprising  of  all  the  various  physical  locations  that 
the  target  might  visit.  These  might  be  static  locations  such  as  the  cafe  mentioned  above  or 
the  terrorist’s  house  but  could  also  include  more  dynamic  states  such  as  the  target  being  in 
a  vehicle  or  on  the  subway.  This  setting  will  serve  as  our  primary  motivation,  and  thus  it  is 
important  to  keep  it  in  mind  to  provide  a  framework  on  which  our  algorithms  will  be  built. 

1.2.1  Migratory  Patterns  of  Endangered  Birds 

Instead  of  attempting  to  follow  a  potential  terrorist  in  a  city  (the  person  of  interest  or  target), 
imagine  the  goal  is  to  learn  the  migratory  patterns  of  an  endangered  bird.  In  this  setting  we 
would  model  the  state  space  primarily  to  cover  such  things  as  latitude  and  longitude  (each 
state  corresponding  to  a  physical  location  and  a  small  set  of  activities).  The  searcher  would 
then  apply  our  approach  on  this  specific  flock  of  birds,  learning  over  time  the  basic  seasonal 
patterns  of  this  bird  species. 
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1.2.2  A  Pirate  Hiding  in  a  Fishing  Fleet 

In  this  setting,  which  is  much  closer  to  our  primary  scenario,  the  searcher  is  attempting  to 
learn  the  overall  behavior  pattern  of  a  fishing  fleet  in  order  to  determine  if  a  pirate  is  hiding 
in  the  fleet.  In  this  setting  the  searcher  would  replicate  our  algorithm  for  each  “fishing” 
vessel,  updating  these  algorithms  with  a  drone  or  Unmanned  Arial  Vehicle  (UAV)  network 
or  mesh.  Each  UAV  would  update  all  algorithms  based  on  which  vessels  it  can  see  during 
that  time  step. 

1.3  Current  Solution  Methods 

An  important  question  that  needs  to  be  raised  though  is,  how  is  this  problem  currently  being 
solved?  Essentially,  if  this  thesis  did  not  exist,  what  would  people  use  instead?  While  an 
answer  to  this  question,  by  nature,  cannot  be  comprehensive  it  will  still  be  beneficial  to 
examine  how  someone  might  attempt  to  learn  the  behaviour  pattern  of  a  terrorist  within 
a  city  without  the  algorithms  developed  in  this  thesis.  The  naive  (not  intended  here  as 
derogatory)  approach  to  following  a  target  is  to  estimate  the  transition  matrix  by  observing 
the  target’s  transitions.  For  example,  if  the  searcher  knows  the  target  is  currently  in  state  a, 
the  searcher  would  then  look  in  state  b  during  the  next  time  step.  Every  time  the  searcher 
observes  or  fails  to  observe  the  target,  more  information  is  gained.  After  accumulating 
a  number  of  hits  and  misses  (assuming  stationary  target  behaviour),  the  searcher  can  use 
the  current  time  step’s  estimated  transition  matrix  to  decide  where  to  put  the  sensor  next. 
The  naive  method  for  generating  this  estimated  transition  matrix  is  to  use  the  ratio  of  hits 
(observed  transitions)  to  total  attempted  transition  observations  as  the  probability  for  the 
target  to  make  that  specific  transition. 

Initially  of  course,  the  searcher  will  not  have  a  sequence  of  hits  and  misses  for  use  in 
estimating  the  target’s  behaviour  so  will  have  to  begin  with  pure  exploration  (i.e.,  assume 
uniform  probabilities  as  the  first  estimate).  Later  on,  after  obtaining  a  sequence  of  hits  and 
misses,  placing  the  sensor  in  the  most  likely  target’s  location  (pure  exploitation)  may  result  in 
poor  performance.  A  more  sophisticated  approach  would  include  a  term  to  force  exploration 
of  the  whole  state  space  (so  that  learning  takes  place).  This  and  other  approaches  are  fleshed 
out  in  Chapter  3:  Methodology.  We  develop  an  algorithm  that  includes  the  sequence  of  hits 
and  misses  plus  a  judiciously  chosen  inflation  term  to  force  exploration  that  consistently 
learns  faster. 
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1.4  Our  Approach 

Our  goal  is  to  develop  an  algorithm  that  learns  faster  than  the  naive  approach.  Specifically, 
we  apply  methods  from  the  following  fields  of  research:  Applied  Probability,  Optimization, 
Online  Learning,  and  Statistics.  We  start  with  an  ideal  situation  where  we  not  only  know 
the  transition  dynamics  (or  probabilities)  but  also  have  some  form  of  Oracle  which  provides 
us  the  actual  location  or  state  the  target  transitioned  to  if  we  miss  it.  We  then  sequentially 
relax  the  assumptions  for  this  situation  until  we  get  to  the  point  where  we  do  not  know  the 
transition  dynamics  at  all  and  do  not  have  an  Oracle.  This  of  course  is  more  fully  discussed 
in  Chapter  3:  Methodology. 

Our  algorithm  leverages  not  only  the  ratio  of  hits  over  total  observation  attempts  (hits  and 
misses)  for  a  single  transition  probability  but  also  the  fact  that  each  miss  contains  quite  a 
lot  of  information  regarding  other  transitions.  This  additional  information  comes  from  the 
assumption  that  we  designed  the  state  space  to  be  comprehensive.  In  other  words,  if  the 
target  is  currently  at  state  a,  it  must  either  stay  in  that  state  or  transition  to  another  state 
within  the  state  space.  We  assume  that  the  target  cannot  jump  out  of  the  state  space  (i.e., 
we  have  defined  the  state  space  to  be  exhaustive). 

Therefore,  in  the  situation  just  described,  if  the  searcher  allocates  the  sensor  to  look  in 
state  a  again  but  the  target  is  not  there,  we  can  distribute  the  weight  of  this  miss  (in  the 
naive  approach  this  means  just  increasing  the  denominator  of  the  ratio;  in  our  approach  this 
weighting  is  calculated  by  an  optimization  problem)  across  all  the  rest  of  the  states  (as  a 
miss  for  the  a  to  a  transition  but  as  a  partial  hit  for  the  a  to  x  transition,  x  being  all  non -a 
states). 

Intuitively,  the  two-state  example  is  the  easiest  to  see.  If  our  state  space  is  {a,  b}  and  the 
target  is  currently  in  state  a,  then  if  it  does  not  transition  back  to  a  we  know  implicitly 
that  it  must  have  transitioned  to  b.  Therefore,  instead  of  just  updating  the  a  to  a  transition 
probability  with  a  miss,  we  also  update  the  a  to  b  transition  probability  with  a  hit.  In  the 
two-state  setting  a  miss  provides  just  as  much  information  as  a  hit. 


4 


1.5  Measures  of  Success 

How  will  we  know  if  we  have  succeeded?  We  measure  our  success  by  our  algorithm’s 
expected  regret  as  compared  to  the  naive  approach  mentioned  above  in  Section  1.3.  We 
define  expected  regret  as  the  cumulative  difference  between  our  estimate  of  the  target’s 
transition  probability  and  the  true  probability  over  time.  Further,  we  seek  to  provide  upper 
bounds  (i.e.,  worst  case  bounds)  on  the  performance  of  our  algorithm,  namely  on  the 
expected  regret  growth  over  time.  This  is  explained  in  more  detail  in  Chapters  4  and  5. 

1.6  Ground  Rules:  Scope,  Limitations,  and  Assumptions 

Before  going  any  further,  it  will  be  helpful  to  lay  out  a  scope  for  this  thesis,  some  limitations, 
and  a  few  assumptions  we  are  making  in  our  approach.  In  this  thesis  we  develop  an  algorithm 
that  utilizes  and  optimizes  over  the  one-step  misses  (hence,  the  name  “Single-Miss  MLE”) 
returned  from  a  sensor  which  only  provides  binary  responses,  as  mentioned  in  Section  1.2. 
We  define  a  one-step  miss  as  the  situation  where  the  searcher  knows  the  location  of  the 
target  in  the  previous  time  step  but  missed  it  in  the  current  time  step.  If  the  searcher  missed 
it  again  in  the  next  time  step,  that  would  be  a  two-step  miss. 

Additionally,  we  assume  that  the  state  space  is  comprehensive  (i.e.,  the  target  cannot  “jump” 
out  of  the  state  space);  we  only  examine  sensors  that  have  neither  false  positive  nor  false 
negative  rates;  we  only  consider  discrete  time;  we  assume  that  the  target’s  behavior  is 
stationary  (i.e.,  the  transition  matrix  doesn’t  change  over  time);  we  only  consider  one  target 
and  one  sensor;  and  we  assume  that  the  sensor  is  unobserved  by  the  target.  Relaxing  any 
of  these  last  assumptions  or  limitations  would  make  very  good  extensions  and  we  hope  to 
explore  them  in  future  work. 
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CHAPTER  2: 

Background  and  Literature  Review 


The  purpose  of  this  chapter  is  to  orient  the  reader  to  where  this  thesis  research  fits  into 
the  broader  discipline  of  Operations  Research  as  well  as  provide  a  common  background  of 
concepts  that  we  will  leverage  and  build  upon.  The  topics  or  areas  of  research  that  intersect 
with  our  problem  are:  Persistent  Surveillance,  Search  and  Detection  Theory,  Markov  Chains 
from  Probability  Theory  and,  Online  Learning,  specifically  the  Multi -Armed  Bandit  (MAB) 
approach  to  Machine  Learning.  The  first  two  are  parallel  efforts  that  we  wish  to  assist  by 
attacking  our  problem  from  an  Online  Learning  approach  using  the  setting  of  a  Markov 
Chain  that  will  provide  a  more  general  range  of  settings  than  usually  seen. 


2.1  Markov  Chains 

In  this  section  we  refresh  the  reader  on  some  of  the  basics  regarding  Markov  Chain  theory. 
This  is  necessary  as  we  intend  to  leverage  the  power  and  flexibility  of  the  Markovian  approach 
enabling  us  to  effectively  model  a  wide  range  of  real-world  search  and  observation  problems. 
Much  of  this  material  is  a  summary  from  the  incredibly  useful  “Probability  Models  for 
Practitioners”  [1]  class  notes  written  by  Professor  Kyle  Lin  from  the  Naval  Postgraduate 
School  (NPS). 

Here  we  take  a  moment  to  define  a  Markov  Chain  for  the  reader.  A  discrete  time  Markov 
Chain  is  a  sequence  of  random  variables  X\,  Xo, . . .,  indexed  by  time  taking  values  in  some 
state  space,  with  the  property  that  future  values  Xt+\,  Xt+2, ...  are  only  dependent  on  the 
current  state  Xt  =  x,  and  therefore  conditionally  independent  of  the  past.  We  call  this 
independence  the  Markov  property.  While  this  may  seem  like  a  rather  large  assumption 
to  make,  if  needed,  we  can  embed  information  about  the  past  into  the  current  state  thereby 
maintaining  this  assumption  without  losing  the  mathematical  power  of  the  memoryless 
property  of  Markov  Chains. 
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More  formally,  a  discrete  time  Markov  chain  is  a  stochastic  process  (Xt  :  t  =  1, 2, ... ) 
taking  values  in  a  discrete  state  space  S  =  {1,2,  ...,5},  that  satisfies  the  Markov  property, 
meaning  that 

P(Xt+ 1  €  A  |  XU  . . . ,  Xt)  =  P(Xt+ 1  6  A  |  Xt), 

for  A  c  S,  so  that  the  distribution  of  Xt  depends  on  the  past  only  through  Xt-\.  Appendix 
B  provides  a  simple  example  of  a  discrete  time  Markov  Chain. 

2.2  Machine  Learning 

In  this  section,  we  delve  into  the  topic  of  Machine  Learning,  specifically  its  sub-discipline, 
Online  Learning.  We  give  an  overview  of  Online  Learning  and  then  delve  into  some  specific 
methods  that  we  use  in  this  thesis  including  the  MAB  Problem,  Thompson  Sampling  (TS), 
the  classical  Maximum  Likelihood  Estimation  (MLE)  Point  Estimation  method,  and  finally 
the  Upper  Confidence  Bound  (UCB)  algorithm  for  the  Stochastic  MAB  problem.  We 
include  a  short  discussion  of  MLE  because  it  is  a  critical  component  in  our  approach  to 
estimating  a  given  probability  based  on  a  sequence  of  data. 

2.2.1  Online  Learning  Overview 

Online  Learning  or  Online  Convex  Optimization  (OCO)  is  a  sub-discipline  of  Machine 
Learning  under  which  it  was  first  defined.  Hence,  it  primarily  studies  the  performance  of 
learning  algorithms.  As  indicated  by  the  second  title,  at  heart  it  is  optimization  within  a 
dynamic  setting  vice  the  standard  deterministic  setting  and  therefore  has  very  broad  applica¬ 
bility.  As  succinctly  stated  by  Hazan,  in  his  Introduction  to  Online  Convex  Optimization  [2] : 


In  many  practical  applications  the  environment  is  so  complex  that  it  is  infeasible 
to  lay  out  a  comprehensive  theoretical  model  and  use  classical  algorithmic 
theory  and  mathematical  optimization.  It  is  necessary  as  well  as  beneficial  to 
take  a  robust  approach:  apply  an  optimization  method  that  learns  as  one  goes 
along,  learning  from  experience  as  more  aspects  of  the  problem  are  observed. 
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Of  note,  this  conceptualization  blurs  the  classic  definitions  or  boundaries  of  deterministic 
modeling,  stochastic  modeling,  and  optimization  methodologies  [2].  As  can  be  seen  from 
the  dynamic  setting,  the  algorithm  doesn’t  have  all  of  the  information  in  the  beginning  but 
still  must  act. 

The  basic  framework,  from  Hazan  [2],  is  that  a  player  (the  searcher  in  our  setting)  makes 
iterative  decisions  while  online  (think  making  decisions  with  partial  information  and  ad¬ 
ditional  information  arriving  as  a  stream).  The  underlying  game  structure  is  explained 
later  in  this  section.  When  the  player  makes  each  decision,  the  outcomes  (think  penalty  or 
loss)  associated  with  those  possible  choices  are  unknown.  Once  the  player  commits  to  a 
choice,  he/she  will  receive  the  amount  of  loss  associated  with  that  specific  choice.  Unlike 
Dynamic  Programming,  the  player  does  not  know  in  advance  the  losses  associated  with  a 
given  decision.  Again,  as  mentioned  above,  these  losses  are  unknown  to  the  player  prior 
to  making  his/her  decision.  While  these  losses  can  be  dependent  on  the  player’s  choices, 
they  could  also  be  assigned  by  an  adversary  or  opponent!  The  following  restrictions  must 
therefore  be  imposed  to  make  this  framework  feasible  and  complete: 

1.  The  losses  associated  with  the  set  of  choices  must  be  bounded.  Otherwise,  the  adver¬ 
sary  could  decrease  the  scale  of  losses  each  iteration  such  that  the  player  (algorithm) 
would  never  recover  from  the  first  loss.  Therefore,  the  losses  must  reside  within  a 
bounded  region. 

2.  The  set  of  decisions  facing  the  player  or  algorithm  must  be  somehow  bounded  or 
structured.  This  ensures  that  we  have  some  sort  of  meaningful  performance  metric 
and  prevents  the  adversary  from  assigning  large  losses  to  each  choice  made  by  the 
player  (algorithm)  indefinitely  while  separating  a  set  of  strategies  that  have  no  loss. 

Essentially,  this  framework  can  be  viewed  as  a  structured  and  repeated  game  [2].  The 
following  notation  helps  solidify  this  framework.  The  set  of  decisions  is  a  convex,  nnon- 
empty,  bounded,  and  closed  set  in  Euclidean  space  %  c  R"  with  the  costs  being  modeled  as 
bounded  convex  functions  over  'K  [2] .  At  iteration  /  €  T.  T  being  the  total  number  of  game 
iterations,  the  player  faces  the  set  of  decisions  xt  e  % .  After  committing  to  a  choice,  the 
associated  cost  function  is  displayed  or  revealed:  f,  e  T  :  %  — »  R.  Hazan  further  defines 
T  as  being  the  bounded  family  of  cost  functions  available  to  the  adversary.  The  cost  that 
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the  player  must  now  pay  is  ft(xt).  Our  performance  metric,  taken  from  game  theory,  is  the 
sum  of  the  regret  or  difference  between  the  lowest  possible  cost  (from  the  cost  function)  and 
the  one  actually  incurred  by  the  player  each  iteration.  We  define  this  as  the  regret  [2].  To 
formally  define  regret,  consider  an  OCO  algorithm,  3\  that  maps  an  online  player’s  game 
history  to  a  specific  decision  in  the  set  of  decisions  over  time  [2].  This  player’s  or  algorithm 
Sft’s  regret  after  T  iterations  is  formally  defined  by  Hazan  in  [2],  Equation  1.1,  as: 


regretr(J?l)  =<RT(tf I)  =  sup 


T 


t= 1 


T 


-  min 

xe'K 


Yj  /tW 


(2.1) 


From  this  definition  of  regret,  we  see  that  it  is  desirable  for  the  regret  to  be  sub-linear  as 
a  function  of  time  or  T.  This  setting  or  framework  for  online  learning  has  become  very 
popular  recently  primarily  due  to  its  powerful  modeling  capabilities  [2].  Specifically,  it  can 
be  used  to  model  such  diverse  problems  as  online  routing,  advert  placement  and  selection, 
and  even  spam  filtering  [2]. 


2.2.2  The  Multi-Armed  Bandit  (MAB)  Problem 

In  this  sub-section  we  delve  further  into  the  Online  Learning  framework  with  a  specific,  and 
rather  popular  variant,  the  Stochastic  MAB.  The  Stochastic  MAB  is  named  after  the  “Single- 
Arm  Bandit,”  a  Vegas  slot  machine,  which  still  serves  as  one  of  the  best  ways  to  describe 
this  problem.  Of  note  up  front,  much  of  the  material  from  this  section  is  a  summary  of  or 
taken  directly  from  Mahajan  and  Teneketzis’  Multi-armed  Bandit  Problems  [3]  and  Agrawal 
and  Goyal’s  Analysis  of  Thompson  Sampling  for  the  Multi-Armed  Bandit  Problem  [4]. 

Imagine  a  player  entering  a  casino  and  purchasing  20  “tokens”  with  which  to  play  a  row 
of  “Single-Arm  Bandits.”  We  call  this  row  of  slot  machines  a  “Multi-Armed  Bandit”  with 
each  slot  machine’s  lever  or  handle  being  an  “Arm”  of  this  MAB.  Further,  we  assume 
that  each  slot  machine  or  arm  has  the  potential  to  have  a  payout  or  reward  (based  upon  a 
potentially  different  underlying  probability  distribution  for  each  arm)  and  each  arm’s  reward 
is  a  realization  or  sample  from  that  distribution.  Since  the  player,  does  not  initially  know 
which  arm  will  give  the  best  payout  (i.e.  has  the  most  lucrative  reward  distribution),  the 
player  must  begin  by  exploring  the  system,  using  a  few  tokens  to  compare  the  arms.  At  this 
point,  the  player  has  very  rough  estimates  on  the  potential  rewards  from  a  few  arms  and 
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must  decide  to  either  exploit  the  best  arm  so  far  or  continue  to  explore  the  rest  of  the  arms 
with  the  finite  tokens.  This  tension  between  exploration  and  exploitation  is  the  heart  of  the 
MAB  Family  of  Problems. 

Another  way  of  thinking  about  this  tension  or  describing  this  problem  is  to  imagine  a  player 
with  a  single  resource  at  each  finite  time  step  with  which  to  allocate  to  one  of  a  number  of 
competing  projects.  Upon  allocating  the  resource,  that  project  changes  while  the  rest  stay 
static.  Further,  the  reward  or  return  on  investment  from  that  project  is  different  than  what 
each  of  the  other  projects  might  have  returned.  Hence,  the  MAB  problems  are  a  class  of 
sequential  resource  allocation  problems  [3].  Further,  most  MAB  algorithms  use  the  player’s 
regret  as  their  measure  of  effectiveness.  This  regret  is  essentially  the  same  as  that  defined 
above  in  the  Online  Learning  section.  It  is  defined  as  the  difference  between  the  “best”  arm 
that  the  player  could  have  pulled  that  time  step,  had  he  known  all  of  the  distributions,  and 
the  one  he/she  actually  pulled.  Hence,  the  regret. 

There  are  of  course  many  variants  of  this  problem  or  ways  to  adjust  the  classic  setup  to  fit 
numerous  real-world  situations  such  as:  one  or  multiple  resources  available  for  allocation, 
new  projects  appearing  over  time,  all  projects  changing  each  time  step,  or  even  an  adversary 
who  chooses  the  rewards  of  each  arm. 


We  base  our  formulation  of  the  MAB  problem  on  Agrawal  and  Goyal  [4] .  Consider  a  casino 
with  %  slot  machines,  each  of  these  “arms”  denoted  by  i  e  (K .  At  each  discrete  time  step, 
n  =  1, 2, 3, . . .,  the  player  must  decide  which  of  the  'K  arms  to  pull.  Each  arm,  i,  returning 
a  random,  positive,  real- valued,  reward  with  support  on  [0,  1].  The  rewards  returned  from 
each  arm,  immediately  after  pulling  that  arm,  are  independent  and  identically  distributed  as 
well  as  independent  of  the  play  of  the  other  arms.  Therefore,  the  player  or  MAB  algorithm 
must  decide  which  arm  to  pull,  at  time  n,  based  on  the  rewards  received  (or  in  other  words, 
the  information  obtained)  up  through  time  n  -  1.  Next,  we  define  as  the  (unknown) 
expected  or  average  reward  for  arm  i,  i(n)  as  the  specific  arm  played  at  time  step  n,  and 
Pi(n)  as  the  (unknown)  expected  or  average  reward  for  the  arm  i  pulled  at  n. 


The  goal  therefore,  is  to  maximize  the  total  expected  reward  by  time  N.  We  denote  this 

r  n  n 


expected  total  reward  by: 


Z  /h’(rc) 


Since  this  measure  doesn’t  really  tell  us  how  well 


L  n=l  J 

our  algorithm  performs  over  time  (due  to  no  comparison  with  how  well  we  could  have  done) 
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we  instead  use  the  equivalent  measure  of  expected  total  regret.  This  regret,  as  mentioned 
above,  is  the  difference  between  the  optimal  arm  and  i(n),  the  arm  we  played.  To  define  this 
regret,  K,  let  p*  :=  max,  //;,  A,  :=  p*  -  p,  (which  is  always  greater  than  or  equal  to  zero  by 
definition),  and  further,  let  kj(n )  be  the  number  of  times  the  algorithm  has  pulled  arm  i  by 
time  77.  Finally,  we  formally  define  this  regret  as  follows: 


r  N 


E[ft(iV)]  =E 


L  n=  1 


=^A,-E[fc,(A0] 


(2.2) 


2.2.3  Thompson  Sampling  (TS) 

We  will  now  examine  the  basic  TS  algorithm  for  the  Bernoulli  Bandit  problem.  This 
section’s  material  and  notation  (notation  does  not  follow  previous  sections)  is  from  Agrawal 
and  Goyal’s  Analysis  of  Thompson  Sampling  for  the  Multi-Armed  Bandit  Problem  [4]. 

First,  we  examine  a  special  case  of  the  above  Stochastic  MAB  problem  where  we  consider 
the  situation  where  the  return  or  result  from  pulling  an  arm  is  Bernoulli  or  binary,  i.e. 
hit  or  miss,  success  or  failure.  We  model  this  response  as  simply  0  or  1.  So,  for  arm  i, 
the  probability  of  success  or  of  getting  the  reward  (reward  =  1)  is  pi.  This  special  case  is 
called  the  Bernoulli  Bandit  algorithm.  This  algorithm  uses  Bayesian  priors  on  the  Bernoulli 
means,  s,  for  which  Agrawal  and  Goyal  propose  the  Beta  family  of  distributions.  They 
propose  this  because  the  Betas  have  support  on  the  interval  (0,1),  are  continuous  probability 
distributions,  and  also  enable  a  very  natural  posterior  update;  in  other  words,  the  Beta  and 
Bernoulli  distributions  form  a  conjugate  prior  structure.  The  following  is  a  quick  summary 
of  Beta  distributions,  followed  by  an  exploration  of  the  proposed  Bernoulli  Bandit  algorithm. 

The  Beta  family  of  distributions,  as  mentioned  above,  are  continuous  probability  distribu¬ 
tions  with  support  on  the  interval  (0,1).  Below  is  a  plot  of  some  of  these  distributions  for 
various  ranges  of  the  parameters.  Their  Probability  Density  Function  (PDF),  Beta(a,  /?), 
with  parameters  a  >  0  and  (5  >  0,  is  given  by:  f(x;  a,  J3)  =  r^p^).vQ'~1(l  -  x)P~l,  and 
their  mean  is  given  by  //Beta  =  ffjj  [4].  As  can  be  seen  from  this  equation,  the  higher  the 
a  and  /3’s  are,  the  lower  the  variance.  You  can  see  this  decreased  variance  in  Figure  2.1, 
specifically,  the  gold  distribution  as  compared  to  say  the  light  blue  one. 
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Probability  Density  Function  (pdf) 
(  Beta  Family  of  Distributions  ) 
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Figure  2.1.  The  Family  of  Beta  Distributions 

The  basic  TS  algorithm  uses  the  red  distribution,  in  Figure  2.1,  as  the  prior  for  each  arm. 
Specifically,  it  assumes  that  arm  i  has  a  prior  Beta(l,l)  on  //,.  This  is  a  natural  choice  as  it 
essentially  assumes  a  uniform  distribution  on  the  interval  (0,  1)  [4].  Of  note,  the  following 
notation  breaks  from  that  defined  in  previous  sections  in  order  to  more  closely  follow  [4]. 
Next,  at  time  t,  having  observed  S^t)  successes  or  hits  (think,  reward  of  1)  and  Fi(t)  failures 
or  misses  (think,  reward  of  0)  in  ki(t)  =  5,-(f)  +  Fi(t)  plays  of  arm  i,  the  algorithm  updates 
the  current  distribution  of  //,  to  BctakSh,,  +  1,  F,(t)  +  1).  Lastly,  the  algorithm  samples  from 
these  posterior  distributions  for  the  means  of  the  arms,  //,■’ s,  and  plays  the  arm  with  the 
highest  probability  of  having  a  success  or  in  other  words,  the  largest  mean.  This  method 
from  Agrawal  and  Goyal  [4]  is  summarized  in  Algorithm  1,  found  in  Table  2.1. 
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Algorithm  1:  Basic  Thompson  Sampling  for  Bernoulli  bandits 
For  each  arm  i  =  1, 2, . . . ,  N  set  S;  =  0,  F,  =  0 
for  t  =  1,2,  .. .  do 

for  each  arm  i  =  1,  . . . ,  N  do 

!  Sample  0/(0  from  the  Beta(S/(f)  +  1,  F/(o  +  0  distribution. 

end 

Play  arm  i(t)  :=  argmax,-  0,(0  and  observe  reward  rt. 

if  rt  =  1  then 

|  Set  Si(t)  =  S/(£)  +  1 

else 

|  Set  Fi(t)  =  Fl(t)  +  1 

end 

end 

Table  2.1.  Algorithm:  Basic  Thompson  Sampling 

The  basic  idea  behind  this  algorithm  is  that  at  each  iteration  or  time  step,  the  TS  algorithm 
attempts  to  pull  the  arm  with  the  largest  probability  of  returning  a  reward  of  1 .  The  intuition 
here  is  that  each  reward  of  1  increments  the  a  parameter  of  the  associated  Beta  distribution, 
shifting  it  closer  to  one,  while  each  reward  of  zero  increments  the  /?  parameter  of  the 
associated  distribution,  shifting  it  closer  to  zero.  To  see  this  graphically,  look  at  the  light 
blue  and  purple  distributions  in  Figure  2.1.  Additionally,  as  the  number  of  samples  increases 
the  variance  of  the  resultant  Beta  distribution  decreases  as  mentioned  in  the  last  paragraph. 
This  can  also  be  seen  in  Figure  2.1,  specifically,  the  gold  distribution  as  compared  to  the 
light  blue  one.  Further,  because  it  constantly  updates  the  estimated  Beta  distributions,  the 
algorithm  performs  well  (specific  convergence  bounds  for  this  algorithm  can  be  found  in 
Agrawal  and  Goyal’s  paper,  see  [4]).  This  algorithm  is  then  extended  by  Agrawal  and  Goyal 
to  the  general  stochastic  MAB  setting  (more  information  on  this  can  also  be  found  in  their 
paper,  see  [4]). 
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2.2.4  Maximum  Likelihood  Estimation  (MLE) 

Here,  we  will  take  a  moment  to  examine  a  very  useful  method  for  estimating  a  specific 
parameter  from  a  distribution.  In  our  case,  we  are  trying  to  estimate  an  unknown  transition 
probability  from  the  underlying  Markov  Chain’s  transition  probability  matrix.  The  following 
summary  and  material  is  based  on  Jay  Devore’s  textbook,  Probability  and  Statistics  for 
Engineering  and  the  Sciences  [5]  as  well  as  Erich  Lehmann  and  George  Casella’s  seminal 
text,  the  Theory  of  Point  Estimation  [6]. 

First  introduced  by  R.  A.  Fisher,  between  1912  and  1922,  the  method  of  MLE,  as  its  name 
implies,  attempts  to  accurately  estimate  a  distribution’s  parameter(s)  of  interest  given  only 
a  finite  number  of  samples  from  that  distribution.  Specifically,  the  likelihood  function  pro¬ 
vides  us  with  how  likely  the  observed  samples  are  as  a  function  of  the  possible  parameter 
values.  Then,  by  maximizing  the  likelihood  function  the  MLE  method  returns  the  parameter 
values  from  which  the  observed  data  was  most  likely  generated  [5].  The  following  definition 
is  taken  directly  from  Devore  [5]: 


Definition:  Maximum  Likelihood  Estimator 


Let  X\,  X2, . . . ,  Xn  have  a  joint  probability  mass  function  or  PDF  of 

fix  1,  X2,...,xn\  6 1,  02, ... ,  6m)  (2.3) 

where  the  parameters  6\,  02, . . . ,  6m  have  unknown  values.  When  xi,  x2,...,  xn  are 
the  observed  sample  values  and  (2.3)  is  regarded  as  a  function  of  9\,  92, . . . ,  9m,  it  is 
called  the  likelihood  function.  The  maximum  likelihood  estimates  6\,  02,  ■  ■  . ,  9m  are 
those  values  of  the  9f  s  that  maximize  the  likelihood  function,  so  that 

fix  i,  |  0i, ... ,  9,n)  >  fix  i,  |  0i, ... ,  9m)  for  all  0i, . . . ,  9m  (2.4) 

When  A/’s  are  substituted  in  place  of  xf  s,  maximum  likelihood  estimators  result. 

Table  2.2.  Definition:  Maximum  Likelihood  Estimator.  Reproduced  from 
Devore’s  Probability  and  Statistics  for  Engineering  and  the  Sciences  [5] 
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But,  why  use  the  MLE  method  instead  of  some  other  option?  As  briefly  mentioned  by 
Devore  in  [5]  and  closely  examined  and  proved  by  Lehmann  in  [6],  the  MLE  method  has 
some  very  useful  and  important  properties  making  it  a  good  choice.  These  properties  are: 

1.  Asymptotic  Consistency 

Under  many  conditions,  MLEs  converge  in  probability  to  the  true  parameter  value, 
8.  Lurther,  by  increasing  our  sample  size,  n,  we  can  also  achieve  an  arbitrary  level  of 
precision  [6]. 

2.  Asymptotic  Efficiency 

This  means  that  as  the  sample  size  n  increases  (tends  towards  infinity)  under  certain 
conditions  the  MLE  converges  to  the  true  parameter  value,  8,  as  fast  as  the  quickest 
possible  method.  In  other  words,  this  method  converges  as  quickly  as  theoretically 
possible.  It  achieves  the  so-called  Cramer-Rao  lower  bound,  which  means  that  no 
consistent  estimator  can  converge  more  quickly  [6],  [7],  [8].  While  other  consistent 
estimators  may  match  an  MLE  in  convergence  rate,  they  are  not  able  to  beat  it. 

3.  Asymptotic  Normality 

Again,  as  n  increases,  the  MLEs  converge  in  distribution,  under  certain  conditions, 
to  a  Gaussian  (normal)  distribution  with  the  mean  being  equal  to  the  true  parameter 
value,  8,  and  a  minimal  variance  [6].  Which,  according  to  Devore  is  “as  small  as  or 
nearly  as  small  as  can  be  achieved  by  any  estimator.”  [5] 
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2.2.5  Upper  Confidence  Bound  (UCB)  Strategies 

This  section  covers  a  critical  strategy  for  developing  theoretical  upper-bounds  on  the  MAB 
convergence  rate  or  expected  regret  in  a  specific  setting.  This  section’s  material  and  notation 
closely  follows  Bubeck  and  Cesa-Bianchi’s  Regret  Analysis  of  Stochastic  and  Nonstochastic 
Multi-armed  Bandit  Problems  [9]  as  well  as  loosely  following  Elad  Hazan’s  Introduction 
to  Online  Convex  Optimization  [2]. 

In  order  to  examine  this  topic,  it  is  necessary  to  start  with  some  basic  definitions  from  the 
Stochastic  Bandit  problem.  From  Bubeck  and  Cesa-Bianchi  [9],  each  arm  i  e  {1, . .  .,K} 
is  tied  to  an  unknown  probability  distribution  i The  player,  at  each  time  step  t  =  1, . . .  „ 
picks  an  arm  It  e  { 1, . . . ,  A'}  receiving  a  reward  X/t  t  from  the  probability  distribution  vjt . 
This  reward  being  independent  of  the  past.  Further,  we  denote  the  mean  of  arm  i  with  pj 
and  then  define  the  optimal  mean  and  arm  below,  which  are  unknown  to  the  player:  [9] 

p*  =  max  pi  and  i*  e  argmax  p\ 

i=h-,K  i=l,...,K 

Next,  we  define  pseudo-regret  following  Bubeck  and  Cesa-Bianchi  [9]  as: 

n 

Rn  =  np*  -E^pIt  (2.5) 

t= l 

If  i*  was  known,  then  the  agent  would  simply  pull  that  arm  in  each  iteration  in  order  to 
minimize  the  pseudo-regret.  Of  note,  the  pf ’s  in  Equation  2.5  are  the  actual  means  from 
those  arms’  distributions  which  are  unknown  by  the  agent. 

We  let  7/(5)  =  X;=,  !/,=(',  indicating  the  number  of  times  that  the  player  has  selected  arm  i 
within  the  first  5  time  steps,  and  we  let  A,  =  p*  -  pj,  indicating  the  sub-optimality  parameter 
of  arm  i  (or  the  regret  due  to  this  arm  having  a  larger  penalty  than  the  optimal  arm).  The 
idea  is  that  pulling  an  arm  with  large  A,  induces  a  large  pseudo-regret.  Of  course,  the 
agent  doesn’t  know  the  values  of  A;,  but  they  exist  and  are  well  defined  as  long  as  the  mean 
rewards  are  finite. 
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Following  Bubeck  and  Cesa-Bianchi  in  [9],  we  assume  that  the  distribution  of  rewards  X 
are  light-tailed  and  hence,  there  exists  a  convex  function  if/  on  the  real  numbers  such  that, 
for  all  A  >  0, 


in Ee/l(X~E[X])  <  if/(A)  and  InEC1™  “  X)  <  if/ {A)  (2.6) 


Following  Bubeck  and  Cesa-Bianchi  [9]  further,  the  UCB  algorithm  can  be  applied  to 
any  light-tailed  distribution,  meeting  the  conditions  defined  in  Equation  2.6,  by  forming 
an  index  for  each  arm,  and  then  pulling  the  arm  with  the  largest  index  estimated  so  far. 
Each  index  has  two  components.  The  first  being  the  reward’s  sample  mean  obtained  by 
pulling  arm  i  for  5  times,  ju^s  =  j  X,v=i  Xi,t-  The  second  being  an  inflation  term  (this  is 
the  “upper  confidence  bound”  part)  selected  so  that  the  probability  of  a  suboptimal  index 
being  larger  than  the  optimal  index  is  suitably  small.  In  the  following  index,  a  is  any 
positive  constant,  the  optimal  value  calculation  being  examined  in  further  detail  in  [9]. 
Also,  from  convex  analysis  we  denote  i f/*{e)  as  the  Legendre-Fenchel  transform  of  the 
function  if/  as:  if/*(e)  =  sup /leR  (Ae  -  if/(A))  where  for  example,  if  if/(x)  =  ex  then 
if/*(x )  =  .v  ln(.i')  -  x,  V.v  >  0.  Our  index  then,  at  time  /,  is  defined  as 


fiiji(t-t)  +  (<A*)  1 


a  inf  \ 
Ti(t-  l)j 


for  each  arm  i,  where  if/*(-)  is  the  large  deviations  rate  function  corresponding  to  distribution 
Vi,  and  Tj(l  -  1)  is  the  number  of  pulls  of  arm  i  by  time  t  —  1.  With  this,  one  can  show  that 


P  fiiJAt- 1)  -  (<A+) 


-1 


a  Inf 

m- 1) 


>  w  <  t 


which  implies  that  the  regret  grows  similar  to  log  t . 


The  question  remains  though,  what  is  the  function  if/*(-)‘ ?  If,  Vj  has  bounded  support  (the 
relevant  scenario  for  this  thesis)  over  {-b,b),  for  0  <  b  <  oo,  then,  as  Bubeck  and  Cesa- 
Bianchi  mention  in  [9],  one  can  use  if/ (A)  =  The  Gaussian  case  -  important  for  several 
applications  -  leads  to  if/ (A)  =  where  cr  is  the  variance  constant.  It  is  possible  to  get 
suitable  bounds  for  if/*(-)  under  further  assumptions,  but  this  lies  outside  the  scope  of  this 
thesis. 
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In  summary,  the  algorithm  pulls  arm 


It 


6  argmax 
i=l,...,K  L 


1)  +  OA*)  1 


a  \nt  \ 
Ti(t-  I)) 


at  time  t.  By  using  this  strategy  the  algorithm  achieves  the  following  regret  upper-bound  as 
examined  and  defined  by  Bubeck  and  Cesa-Bianchi  in  [9]: 


Theorem:  Pseudo-regret  of  the  ( a ,  i/O-UCB  Strategy 

Assume  that  the  reward  distributioi 
a  >  2  satisfies 

R„  <  1 

i:  Aj>0 

ns  satisfy  Equation  2.6. 

f  aAf  lnfnj  i  ^  ) 

fhen  (a,  i//)-UCB  with 

f)  a  -  2) 

Table  2.3.  Theorem:  UCB  Pseudo-Regret.  Reproduced  from  Bubeck 
and  Cesa-Bianchi’s  Regret  Analysis  of  Stochastic  and  Nonstochastic  Multi¬ 
armed  Bandit  Problems  [9] 


The  key  intuition  from  this  theorem,  is  that  the  sub-optimal  arms  are  sampled  at  a  natural 
logarithm  of  time  rate  while  the  optimal  arm  is  sampled  at  a  rate  of  time  minus  the  natural 
logarithm  of  time.  The  inflation  term  therefore,  ensures  that  every  arm  is  sampled  preventing 
the  algorithm  from  getting  “stuck”  on  a  sub-optimal  arm  that  just  happened  to  have  a  long 
run  of  good  rewards.  It  is  well  known  that  eliminating  the  inflation  term  from  the  index 
leads  to  a  regret  that  grows  linearly,  as  opposed  to  logarithmically. 
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2.3  Related  Studies 

The  following  are  seven  studies  that  utilize  similar  Machine  Learning  approaches  to  intel¬ 
ligence  collection  operations  and  are  therefore  worth  mentioning  even  though  their  topics 
are  focused  on  slightly  different  applications. 

1.  Costica  in  [10],  examines  methods  for  reducing  the  congestion  generated  during  the 
most  time  consuming  stages  of  the  intelligence  cycle,  namely  the  classification  stage. 
He  proposes  a  tandem  queue  based  optimization  model  as  an  analytic  solution  to  this 
problem. 

2.  Nevo  in  [11],  builds  upon  Costica’s  work  by  further  studying  this  intelligence  cycle 
bottleneck.  He  formulates  the  problem  as  an  exploitation-exploration  trade-off  be¬ 
tween  good  known  intelligence  sources  and  raw  or  unexplored  sources  that  may  or 
may  not  be  valuable. 

3.  Ellis  in  [12],  develops  a  software  library  implementing  Nevo’s  previously  generated 
mathematical  model  of  information  selection  in  an  Online  Learning  setting,  specif¬ 
ically,  the  intelligence  cycle  mentioned  above.  Further,  he  tests  the  performance  of 
these  different  algorithms  in  a  social  communications  network  setting. 

4.  Tekin  in  [13],  analyzes  applying  Online  Learning  methods  to  a  couple  of  the  early 
stages  of  the  intelligence  cycle,  namely,  the  collecting  and  processing  stages.  He 
assumes  that  the  intelligence  products  arrive  sequentially  such  that  Online  Learning 
algorithms  are  a  realistic  approach.  Specifically,  he  developes  a  modified  Thompson 
Sampling  algorithm  to  solve  for  the  optimal  arm  to  select  given  the  most  recent  sam¬ 
ples  analyzed. 

5.  Marshall  in  [14],  approaches  the  collection,  processing,  and  analyzing  stages  of  the 
intelligence  cycle  from  a  MAB  Allocation  (MABA)  framework.  Specifically,  this 
framework  models  the  problem  as  a  “novel  finite  horizon  Bayesian  stochastic  dy¬ 
namic  programming  problem...”  [14].  Further,  he  utilizes  a  novel  Lagrangian  based 
index  heuristic  for  source,  or  arm,  selection. 
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6.  Hepworth  in  [15],  investigates  leveraging  quantile,  or  superquantile,  risk  under  a  loss 
constraint  in  the  context  of  a  MAB  setting.  Specifically,  he  applies  his  algorithms 
in  an  intelligence  collection  setting  where  each  arm  of  the  MAB  corresponds  to  a 
particluar  item  or  document  which  may  yield  significant  or  little  to  no  value  to  the 
intelligence  analyst.  He  develops  two  sequential  elimination  algorithms  which  “select 
the  most  important  source  for  a  given  constraint  level,  sampling  from  the  arm(s)  with 
the  largest  conditional  expectation  over  a  quantile”  [15]. 

7.  Grant  in  [16],  attacks  a  significantly  different  problem  then  those  listed  above  but 
utilizing  similar  methods.  He  focuses  on  the  UAV  Search  Problem,  specifically,  al¬ 
locating  UAVs  to  various  sub-regions  or  boundaries  in  order  to  optimize  detection 
of  events  of  interest.  This  problem  is,  of  course,  of  great  interest  to  the  intelligence 
community  at  large.  He  approaches  this  problem  along  three  broad  avenues:  Intensity 
Estimation,  Optimization,  and  Machine  Learning. 
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CHAPTER  3: 
Methodology 


In  this  chapter,  we  examine  the  basic  problem  and  our  approach  to  its  solution.  As  mentioned 
in  Chapter  1,  we  consider  a  scenario  where  a  searcher  attempts  to  locate  and  maintain 
observation  of  a  target.  We  model  the  target’s  behavior  as  a  discrete  time  Markov  Chain. 
The  associated  state  space  being  the  target’s  location,  activity,  or  any  specific  attributes  that 
change  over  time  or  in  some  sequential  manner.  The  searcher  has  only  one  sensor  with 
which  to  observe/follow  the  entity  of  interest,  receiving  a  reward  of  one  each  time  step 
that  it  detects  the  current  state  of  the  target.  In  general,  the  searcher’s  decision  variable 
is  the  sensor’s  next  location  (i.e.,  a  state  of  the  Markov  chain)  over  time.  The  objective 
of  the  searcher  is  to  allocate  the  sensor  dynamically  so  as  to  earn  the  largest  expected 
total  reward  over  some  finite  time  horizon  N,  with  the  current  discrete  time  step  being 
n  6  N  =  {1, 2, . . .,  N}.  There  are  four  basic  settings  for  this  problem,  which  are  listed 
below.  We  delve  briefly  into  the  first  two  since  they  are  the  simplest  settings  but  focus  the 
majority  of  our  effort  on  the  last  two  as  they  are  the  most  insightful. 

•  Known  Transition  Dynamics  with  an  Oracle 

•  Known  Transition  Dynamics  without  an  Oracle 

•  Unknown  Transition  Dynamics  with  an  Oracle  (Naive  and  Single-Miss  methods) 

•  Unknown  Transition  Dynamics  without  an  Oracle  (hardest  and  most  interesting) 

For  the  last  two  policies  (those  without  an  oracle)  we  assume  that  our  Markov  Chain  is 
irreducible  (i.e.,  the  target  will  never  enter  a  “sink”  state  or  “sink”  subset  of  states).  This 
ensures  that  it  is  possible  for  us  to  regain  observation  of  the  target  once  we  lose  track  of 
it.  Hence,  if  we  (the  searcher)  keep  looking,  at  say,  state  1,  then  at  some  point  we  will 
eventually  regain  observation  of  the  target.  We  define  our  state  space  as  X  =  {1, 2, ... ,  L}. 
Further,  we  use  the  following  notation  to  indicate  states  within  the  state  space:  x,  i,  i,  j  ef. 
Of  note,  Appendix  A  summarizes  and  lists  the  notation  used  within  Chapters  3  and  4. 
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3.1  Known  Transition  Dynamics  with  an  Oracle  (KTO) 

In  this  first  and  very  basic  setting,  Known  Transition  Dynamics  with  Oracle  (KTO),  we 
assume  that  after  each  time  step,  the  searcher  is  given,  by  the  Oracle,  the  true  move¬ 
ment/location  of  the  target.  For  example,  if,  at  n  =  5,  the  searcher  looks  in  state  i  but  the 
target  actually  transitioned  to  state  j,  the  searcher  is  given  that  information  before  moving  to 
the  next  decision/time  step,  n  =  6.  This  enables  us  to  easily  resolve  the  rewards  and  choose 
where  to  send  the  sensor  for  the  next  time  step.  Further,  we  also  assume  that  the  underlying 
transition  probabilities  are  known,  making  this  setting  primarily  an  exploitation  problem 
removing  the  standard  exploration-exploitation  tension.  In  this  setting  the  searcher  always 
places  the  sensor  on  the  mode  or  state  with  the  highest  transition  probability. 


3.2  Known  Transition  Dynamics  with  No  Oracle  (KTNO) 

In  this  section,  since  the  searcher  knows  the  true  transition  probabilities,  we  just  condition 
on  the  last  known  location  of  the  target  and  the  subsequent  sequence  of  misses  to  determine 
the  most  likely  state  to  which  to  send  or  allocate  the  sensor.  In  essence,  we  power  up  the 
sub-matrices  of  P. 

More  precisely,  suppose  the  target  was  last  seen  in  location  x\  in  period  1,  the  sensor  misses 
the  target  in  states  X2,  X3,  ,  a„_i.  The  question  for  the  searcher  is:  Where  to  place  the 

sensor  in  period  n  given  the  last  known  location  and  the  sequence  of  misses  (i.e.,  the  sample 
path  X\  =  x\,  X2  ^  X2, . . . ,  Xn-i  ±  xn-\)l  For  xeXwe  consider 

XV  {Xn  =  x\Xn-\  ^  xn—\, . . . ,  X2  X2,  X\  =  Al) 

_  XV  {Xn  —  x,  Xn-\  ^  xn—i, . . . ,  X2  X2,  Xi  =  xi ) 
Pr{Xn- 1  t  xn-u  . . . ,  X2  ±  x2,  Xx  =  x\) 

The  goal  is  to  find  the  most  likely  state  in  period  n,  so  it  suffices  to  find  the  maximizer  of 
the  numerator  above.  That  is,  we  want  to  maximize 

Pr(Xn  =  x,  Xn_i  t  xn-i,  ...,X2±  x2,Xi  =x{)  =  P  -  ■  P  -  ,-••••'  Px-  _ 

A  1  A  1  *A 

1’  2  2’  3  n-V 

for  all  x  6  X,  where  PXuX-  is  row  xi  without  the  element  PXux2  °f  Ihe  transition  matrix, 
Px-  x-  is  the  sub-matrix  formed  by  removing  row  ao  and  column  A3,  and  Px-  x  is  the  a  fill 
column  of  P  without  the  a„_i  entry  of  the  transition  matrix. 
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As  an  example,  consider  a  transition  probability  matrix  over  the  states  { 1, 2, 3}. 

1  2  3 

l  [  0.1  0.2  0.7 

P  =  2  0.5  0.2  0.3 

3  [  0.8  0.1  0.1 

Suppose  that  jti  =  1.  Since  the  mode  of  the  first  row  is  the  third  entry,  the  searcher  places 
the  sensor  in  state  3  in  period  2.  If  the  target  is  not  found  there,  then  X3  =£  3.  Using  the 
formula  above  leads  to 

O.l' 

Pr(X3  =  l,  X2*  3,  X\  =  1)  =  [0.1, 0.2]  x  =0.11 

0.5 


0.2 

Pr(X3  =  2,Xo*  3,  X\  =  1)  =  [0.1, 0.2]  x  =  0.06 

0.2 

and 

0.7" 

Pr(X3  =  3,Xo  *  3,  X\  =  1)  =  [0.1, 0.2]  x  =  0.13 

0.3 

so  the  searcher  is  better  off  putting  the  sensor  in  state  3.  As  before,  if  the  target  is  found 
then  we  set  X3  =  3,  otherwise  the  searcher  selects  the  state  corresponding  to  the  largest  of 

0.1  0.2]  [O.l' 

Pr(X4  =  l,X3  ±  3,Xo  *  3,Xi  =  1)  =  [0.1, 0.2]  x  x  0.041 

0.5  0.2  0.5 


0.1  0.2  0.2 

Pr(X4  =  2,X3±  3,Xo  *  3,Xi  =  1)  =  [0.1, 0.2]  x  x  0.034 

0.5  0.2j  [0.2 

and 


0.1 

0.2 

0.7" 

Pr{X4  =  3 ,X3±  3,Xo  3,Xi  =  1)  =  [0.1, 0.2]  x 

X 

0.5 

0.2 

0.3 

0.095 
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Thus,  as  before,  the  searcher  is  better  off  placing  the  sensor  in  state  3  for  period  4.  The 
search  proceeds  along  these  lines  for  as  long  as  the  target  is  not  found.  As  it  turns  out,  it 
is  optimal  for  the  searcher  to  forever  place  the  sensor  in  state  3  as  long  as  the  target  is  not 
found,  since 


0.1 

0.2 

n- 3 

0.7" 

Pr(Xn  =  3,  Xn.x  t  3,  •  •  •  ,  Z3  t  3,  Z2  t  3,  =  1)  =  [0.1, 0.2]  x 

0.5 

0.2 

X 

0.3 

is  larger  than  the  corresponding  computation  for  states  1  and  2,  for  any  n  >  3.  Since 
the  matrix  [0.1, 0.2;  0.5, 0.2]  is  sub-stochastic,  it  can  be  seen  that  Pr(Xn  ±  3,  Xn-\  4 
3,  •  •  •  ,Xt,  ±  3,  X2  ±  3,  X\  =  1)  — >  0,  so  the  target  must  eventually  be  found  by  placing 
the  sensor  in  state  3.  Indeed,  leaving  the  sensor  static  in  (any)  single  state  guarantees  that 
the  target  is  found,  as  long  as  the  Markov  chain  is  irreducible.  Once  found  in  state  3,  the 
searcher  is  then  better  off  placing  the  sensor  in  state  1,  and  so  on. 

These  ideas  can  be  extended  to  the  case  where  the  searcher  has  w  sensors,  1  <  w  <  L; 
when  w  =  L  the  target  is  always  found.  The  idea  is  to  put  the  sensors  in  the  most  likely 
states  conditioned  on  the  initial  known  state  and  the  sequence  of  misses.  Specifically,  for 
a  sample  path  X\  =  x\,  X2  ^  x2, . . . ,  Xn-\  ±  xn_  1,  with  x2, . . . ,  x„_i  6  £w  the  most  likely 
state  in  period  n  is  found  by  finding  the  w  maximizers  of 

Pi  (Xn  =  X,  Xn-\  Xn- 1 ,  .  .  . ,  X2  41  X2,  X\  =  Xi)  =  P x  1  xt  '  Px~x~  ■  ■  ■  Px~  x 

2  2’  3  n-V 

for  all  x  6  X,  where  Pxux-  is  row  x\  without  the  elements  PXUX2  of  the  transition  matrix, 
Px-,x  is  the  sub-matrix  formed  by  removing  rows  xo  and  columns  X3,  and  Px  |  x  is  the  x’th 
column  of  P  without  the  xn_i  entries  of  the  transition  matrix. 
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3.3  Unknown  Transition  Dynamics  with  Oracle  (UTO) 

In  this  third  basic  setting,  Unknown  Transition  Dynamics  with  Oracle  (UTO),  we  assume 
that  after  each  time  step,  the  searcher  is  given  the  true  movement/location  of  the  target, 
again  from  an  Oracle,  but  this  time  he  has  to  learn  or  estimate  the  true  underlying  transition 
probabilities.  For  example,  if,  in  n  =  5,  the  searcher  looks  in  state  i  but  the  target 
actually  transitioned  to  state  j,  the  searcher  is  given  that  information  before  moving  to  the 
next  decision/time  step,  n  =  6.  This  enables  a  quick  resolution  of  the  rewards  and  easy 
calculation  of  the  estimated  transition  probabilities  for  the  next  time  step.  We  explore  this 
setting  in  two  ways.  The  first  is  more  intuitive  and  a  better  method  for  this  setting  but  is 
fragile  or  difficult  to  extend  to  our  final  setting.  Therefore,  we  develop  the  second  approach 
with  a  few  different  variants.  As  a  final  note,  the  developments  in  this  section  apply  when 
the  searcher  has  multiple  sensors. 


Case  1:  Full  dependence  on  Constant  Oracle 

We  examine  a  Markov  chain  with  unknown  transition  probability  matrix,  i.e.,  unknown 
transition  dynamics  but  with  an  Oracle  that  provides  the  target’s  location  at  the  beginning 
of  each  time  step  (therefore  we  are  only  attempting  to  predict  a  single  step  transition).  In 
this  case,  we  do  not  need  to  utilize  a  MAB  approach.  Instead  due  to  the  Oracle,  we  can 
update  the  elements  of  the  empirical  transition  matrix  at  the  end  of  each  time  step.  Hence, 
the  optimal  action  for  the  searcher  is  to  place  the  sensor  in  the  state  that  is  the  empirical 
mode  out  of  the  last  (known  state). 

Case  2:  Naively  Using  the  Oracle  for  Reset 

While  the  above  setting  is  intuitive  and  works  quite  well,  what  if  the  searcher  is  unable 
to  receive  the  Oracle’s  information  every  time  step  and  instead  gets  a  periodic  update  (say 
every  6  or  12  time  steps)?  In  the  operational  setting  this  could  correspond  to  the  target 
(terrorist)  going  home  every  night  or  a  small  fishing  boat  returning  to  harbour  in  the  evening. 
In  both  of  these  cases,  if  a  time  step  is  defined  as  a  single  hour,  the  Oracle  would  provide  its 
periodic  update  every  roughly  12  hours.  This  motivates  the  following  versions  of  the  UTO 
setting. 
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In  this  approach  we  estimate  the  transition  probabilities  using  only  the  hits’  information. 
Specifically,  for  each  state-pair  (i,  j)  e  £2  we  compute  the  ratio  of  the  number  of  hits  in 
state  j  when  the  target  was  just  seen  in  state  i  (the  state  i  — »  j  transition)  up  to  time  n.  h\n) , 
over  the  total  number  of  sensor  placements  in  (or  views  of)  state  j  when  the  target  was  last 
in  state  i  up  to  time  n,  u.y .  Specifically,  our  initial  estimator  of  pij  =  P(Xt+ \  =  j \Xt  =  i )  is 


hin) 

'j 
O)  ’ 

V-  ■ 

'•J 


(3.1) 


when  ty 1  >  0.  Of  note,  y" )  is  the  z’th,  j’th  element  of  Qin) ,  the  estimated  transition 
probability  matrix  at  time  n.  Since  this  estimator  does  not  use  miss  information,  the  £ ;  q". ) 
may  be  strictly  smaller  than  1.  Rescaling  each  row  to  sum  to  1  produces  the  following 
estimator: 


h(n)  I  h' 

w  =  y  - 

q'J  An)  Zj  v 

Vi,j  \AL  vi,( 


,  -1 


(3.2) 


As  an  example  of  these  ideas  imagine  that,  prior  to  rescaling,  we  have  the  below  row  of 
probability  element  estimates  for  a  four  state  Markov  chain: 


h 


(n) 

1,£ 


V 


(n) 

l,£ 


1.  A  5_  12] 
28  16  27  41  J 


and  recall  that  in  this  case  the  oracle  discloses  the  last  position  of  the  target.  This  means 
that  we  can  execute  the  below  updates  at  each  time  step  (hence,  using  the  oracle  only  for 
reset).  Let’s  say  we  choose  to  allocate  the  sensor  to  state  4.  If  the  target  transitions  from  its 
current  state,  1,  to  state  2  we  would  miss  it  and  update  the  above  row  of  fractions  as  follows: 

;  (71+1) 

_  \J_  2_  _5_  12] 

(rc+1)  1 28  16  27  42  J 

\£ 

If,  on  the  other  hand,  the  target  transitions  from  its  current  state,  1,  to  state  4  we  would 
receive  a  hit  and  update  the  above  row  of  fractions  as  follows: 
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O+i) 

1,£ 


V 


O+l) 

l,£ 


1.  A.  A  IT] 

28  16  27  42] 
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From  this  last  row  of  fractions  we  would  update  the  1st  row  of  our  transition  probability 
estimates,  with  the  rescaled  probabilities  as  follows: 


(n) 

%L 


' !_  2_  5_  13' 
.28  16  27  42. 

(’-L  +  JL  +  .L  +  Jl) 

V  28  ^  16  ^  27  ^  42  / ' 


If  the  searcher  can  cover  the  entire  state  space  in  each  iteration  with  sensors,  then  there  are 
no  misses,  and  the  estimator  defined  above  is  the  classical  MLE  obtained  as  the  solution  of 


max 

quh-quL 


subject  to 


and 


=  #  transitions  out  of  state  i  by  time  n 


j 


J]  qLj  =  1,  with  qij  >  0. 
j 


h(n) 

for  each  i  e  X,  when  the  number  of  transitions  out  of  state  i  is  positive  and  also  where  q{ 
is  the  probability  quj  raised  to  the  /z|”  ’th  power.  Of  course,  when  L  -  2  (i.e.,  there  are  just 
two  states),  a  miss  gives  as  much  information  as  a  hit;  the  same  is  true  when  the  searcher 
can  cover  L  —  1  states  with  sensors.  However,  it  is  not  clear  how  this  generalizes  to  larger 
state  spaces,  even  when  just  two  states  are  not  searched  in  a  period  (e.g.,  one  sensor  and 
L  =  3). 


Case  3:  Using  the  Oracle  only  for  Reset 

While  maximizing  the  likelihood  of  the  hits’  observations  is  good,  we  aren’t  distributing 
the  density  of  the  target’s  movement  to  the  numerator  of  our  transition  counters  when  the 
sensor  returns  a  miss  or  zero.  Remember  the  example  from  Case  2  that  only  updated  the 
denominator  for  the  miss.  Intuitively,  we  know  that  a  miss  means  the  target  transitioned  to 
one  of  the  other  (unobserved)  states  but  we  do  not  update  any  of  their  associated  fractions. 
This  means  that  we  are  not  using  all  of  the  information  we  could  glean  from  each  time  step. 
To  rectify  this  situation  we  maximize  the  likelihood  of  both  hit  and  miss  observations. 
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For  ease  of  reading  we  drop  the  time  step  notation  on  all  variables  since  this  calculation 
is  executed  during  a  single  time  step.  Since  we  are  only  looking  at  the  single-step  misses, 
the  maximum  likelihood  optimization  problem  is  separable  into  L  optimization  problems 
(below),  one  for  each  row  of  the  transition  matrix.  The  resultant  likelihood  function, 
capturing  all  of  the  hits  and  single-step  misses  so  far,  that  we  maximize  is  (abusing  notation): 


^■'(<7/,  t» 


qi,L)  =  n  q-y  (1 

7=1 


For  each  i  e  X  with  Vjj  >  0,  we 


max  L(qi' i,  . . .  ,  qhL), 

qi,i,-,quL 


with  the  following  constraints: 

L 


7  =  1 


%j>  0,  Vjex 

Setting  q*j  =  0  is  optimal  when  hjj  =  0,  so  henceforth  we  assume  that  0  <  hjj  <  Vjj.  The 
resultant  log-likelihood  or  objective  function  is 


L 

max  log  L(qjV  ...  ,  qiL)  =  ^  hUj  \og(qUj)  +  (uLJ  -  hLJ)  log(l  -  </,V/). 

%i  '  ’  7=1 


The  objective  function  is  a  sum  of  concave  functions,  and  therefore  is  concave.  The  con¬ 
straint  set,  being  a  simplex,  is  convex.  Hence,  the  Karush-Kuhn-Tucker  (KKT)  conditions, 
as  listed  by  Dimitri  Bertsekas  in  [17]  and  developed  by  William  Karush  in  [18]  and  Harold 
Kuhn  and  Albert  Tucker  in  [19],  are  sufficient  for  optimality.  These  are 


hj,j  _  vUj  hjj 

q.  .  1  -  q.  . 

*hj  0,7 


k  +  Aj,  VjgX 


where  k  is  the  Lagrange  multiplier  for  the  sum  constraint,  and  Aj  >  0  is  the  Lagrange 
multiplier  for  the  non-negativity  constraint,  qj  ■  >  0.  We  have  Aj  =0  if  q{  ■  >  0,  when 
hjj  >  0  (true  by  assumption),  so  we  can  disregard  these  multipliers  in  the  analysis  below. 
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Since  the  partial  derivatives  are  monotone  decreasing  with 


and 


Also 


hlJ 

viJ  hij 

% 

1  -  q.  . 

hij 

viJ  -  hij 

% 

1  -  %j 

at  optimality  since  tl 

hij 

viJ  hij 

qiJ 

l-qiJ 

oo  as  i  — >  0 


-oo  as  j  — »  1. 


Therefore,  all  q.  .  >  0  at  optimality  since  the  Markov  chain  is  assumed  to  be  irreducible. 

*’7 


0 


hij 

qiJ  =  —  ■ 


hi,j  • 


so  that  q.  .  <  —  if  and  only  if  k  >  0.  Summing  over  all  the  q.  .’s  for  this  row  we  get 

^’7  Vi,  /  ^»7 


fc  >  0 


— r  viyj 

7  =  1 


Based  on  the  above  results,  there  are  only  three  possibilities,  dependent  on  the  value  of  the 

L 


h ■  ■ 

sum  of  our  transition  counters  for  this  row,  Y  — .  These  three  possibilities  or  cases  are 

•  i  VU  J 

7  =  1 

enumerated  as  follows: 


L  h-  ■  h-  ■ 

1.  y  -^  =  1,  in  which  case  the  optimal  solution  is  /  =  —  • 

VUj  vi,j 

L  h- 

2.  y  —  >  1,  so  the  optimal  solution  satisfies  the  root  equation  below  for  k  >  0. 

7=1  Uj 


Solving  for  qij  we  get, 
hij  Vij  —  hj j  . 

—  — { - -  =  k  <= 

Qij  1  _  Qij 

for  all  j  6  X. 


hij  J  qij  —  kQiJ  k Qij 


k  qfj — ( Vij  +k)  q^j + hiyj  =  0 
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Hence, 


cIiJ  ~ 


Vij  +  k±  Vij  +  k)~  -  4khij 

~2k 


with  k  >  0.  The  radical  is  positive,  since  Vjj  >  ht  J  >  0  leads  to 

( Vij  +  k )2  -  4 khij  >  hjj  +  2hjjk  +  k2  -  4khjj  =  ( hjj  -  k)2  >  0 
Taking  the  positive  root  first,  q we  get: 

vLj  +  k  +  yj(vij  +  k )2  -  4 khtj  1 


q{+)  = 


2k 


> 


which  is  not  necessarily  true.  Therefore,  we  only  deal  with  the  negative  root,  q: 

1  f-t 

This  boundary  condition,  Yj  Q-  ■  =  1,  leads  to  the  root  equation  in  k  given  by 

./=!  l'] 


l  . - 

(L  -  2 )k  +  vy  -  yj( Vij  +  k )2  -  4 khij)  =  0, 

7  =  1 

with  unique  solution  k*,  by  construction.  In  conclusion,  the  MLE’s  for  the  transition 
probabilities  are, 

Vij  +  k*  -  yj( Vij  +  k*)2  -  4k* hj  j 


CH,, 


2k* 


hi. 


3.  In  case  Y  —  <  1,  with  k*  <  0  in  the  root  equation,  the  MLE  is  the  same, 

•  i  i 

7=1 


clhj  ~ 


Vij  +  k*  -  >ij  +  k*)2  -  4k* hj 
2k* 


h ■  ■ 

but  with  k*  <  0.  The  intuition  is  that  the  empirical  transition  probabilities  -^L, 
obtained  by  only  counting  the  hits,  are  inflated  to  maximize  the  likelihood  of  hits  and 
misses. 
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In  conclusion,  this  MLE  approach,  while  making  use  of  all  the  observations,  has  two  major 
limitations.  Computationally,  it  requires  solving  a  root  problem  after  each  new  observation 
and  operationally,  it  presumes  the  searcher  receives  a  target  location  update  each  time  period 
from  the  oracle.  The  former  can  be  ameliorated  by  warm  starting  the  root  finding  algorithm 
with  the  last  solution  and  the  latter  leads  to  the  next  scenario. 


Case  4:  No  Oracle  Present 

At  the  other  extreme  of  the  gamut,  the  searcher  may  never  receive  feedback  from  an  oracle, 
making  a  MLE,  including  both  hits  and  misses,  substantially  more  difficult.  The  issue  in 
this  setting  is  that  rather  than  having  an  optimization  problem  for  each  row  of  the  transition 
matrix,  due  to  the  separability  structure,  there  is  a  single  optimization  problem  involving  all 
the  transition  probabilities.  In  this  subsection  we  sketch  the  ideas  for  a  solution  approach. 

The  starting  point  is  a  sample  path  of  hits  and  misses.  Namely,  let  t\  =  1,  and  t2,  t2, . . . ,  t^h 
be  the  (random)  times  where  the  target  is  located,  where  1  <  <  n  is  the  (random) 

number  of  times  the  target  is  found  by  time  n.  Then  a  sample  path  between  the  first 
two  hits  is  X\  =  xhX2  i1  x2,  ■  ■ .  ,XT2_i  ±  xT2-i,XTl  =  xT2;  between  hits  two  and  three 
XT2+1  ^  xT2+i,  XT2+2  xT2+2i  ■  ■  .,XT3_1  ^  xT2-i,  XT2  —  xT2,  and  so  on. 

Inspired  by  Section  3.2,  we  maximize  the  likelihood  by  time  (j,  (for  simplicity) 

L(qi,b  ■,  q\,L,  •  •  • ;  qu\,  •  •  • , 

=  Pi  (Xj^  =  xT^ ,  XTi.^ — 1  xT^n-i, . . . ,  XT2+\  xT2+\,XT2  —  xT2,  XT2-\  ^  xT2~\ 

,...,X2*  X 2,Xi  =  X\) 

=  Qxvx-  ■  Qx-,x-  ■■■■■  Qx-_vxT2  ■  QxTrx;2+1  ■  ■■■■  Qx-,n_rxr<n 

for  all  states  x  6  X,  where  Qxvx2  's  row  x\  without  the  element  QX],x2  of  the  estimated 
transition  matrix  Q ,  Qx- tX-  is  the  sub-matrix  formed  by  removing  row  x2  and  column  X3,  and 
Qx-  ^Xt  is  the  xT(ii’th  column  of  Q  without  the  x~„  entry.  The  optimization  problem 
leading  to  the  ML  estimator  is 

max  log  qiL; . . . ;  qu  1, . . . ,  qL>L), 

qij,qj)£-C2 
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for  each  state  pair  (i,j)  with  Vjj  >  0,  with  the  following  constraints: 

L 

Yj  %j  ~  l'  V/  6  L 

7=1 

%j>  0,  V/  e  L 

The  constraint  set,  being  the  intersection  of  L  simplexes,  is  convex.  The  Hessian  of  the 
objective  function  can  be  shown  to  be  negative  definite,  so  that  the  objective  function  is 
concave.  Hence,  as  in  Case  3  above,  the  KKT  conditions  are  sufficient  for  optimality. 
Unfortunately,  the  resulting  root  equations  can’t  be  solved  analytically  in  this  case.  We 
believe  that  an  online  algorithm,  such  as  online  gradient  descent,  would  be  useful  in  this 
setting,  but  leave  as  future  work  the  problem  of  developing  a  more  insightful  approach. 

3.4  Exploration  and  Exploitation 

We  answer  two  primary  questions  in  this  section.  First,  why  not  use  the  current  estimate 
of  the  highest  transition  probability,  i.e.,  the  mode  of  a  given  row  of  Qin>  to  determine 
the  sensor  allocation?  Second,  how  can  the  searcher  incorporate  uncertainty  in  a  way  that 
induces  efficient  exploration  of  all  the  states?  We  tackle  these  questions  in  the  context  of 
Cases  2  and  3  of  the  preceding  section,  for  a  searcher  that  has  only  one  sensor. 

The  main  purpose  of  the  resulting  search  rule  is  to  ensure  that  the  algorithm  finds  or  learns 
the  optimal  arm  or  true  mode  of  a  given  row  of  the  transition  matrix  P,  instead  of  narrowing 
in  on  a  sub-optimal  arm.  A  quick  example  will  help  demonstrate  the  need  or  motivation  for 
this  index  policy.  Consider  the  following  true  transition  probability  vector  for  state  1, 

pu:  =  [to  to  to  to)  • 

Now,  imagine  that  after  a  number  of  periods,  say  n,  the  estimate  p\£  is 

[I  I  A  I] 

L  5  4  20  4  J 

where  ^  should  be  interpreted  as  6  hits  in  state  3  out  of  20  views  in  state  3  when  the  target 
was  just  seen  (or  reported  by  the  oracle)  in  state  1. 
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If  the  searcher  always  selects  the  mode  of  this  row  to  place  the  sensor  then,  since  according 
to  our  current  estimate  it  is  our  best  option,  we  will  continue  to  sample  from  that  state, 
increasing  the  precision  of  this  specific  transition  probability  estimate.  As  n  gets  larger,  the 
estimate  remains  close  0.3  with  high  probability,  since  that  is  in  fact  its  true  value  and  we 
will  therefore  never  explore  the  rest  of  the  arms  of  this  row  since  they  appear  to  be  worse. 
But,  we  know  that  this  is  a  sub-optimal  arm  since  in  fact  p\4  =  0.4.  This  small  example 
highlights  the  need  for  some  form  of  exploration  term  to  force  the  algorithm  to  continue 
exploring  arms  that  appear  to  be  sub-optimal  at  the  time  but  that  may  in  fact  be  optimal,  as 
in  the  case  above. 


The  first  step  to  derive  an  index  that  forces  exploration  is  to  bound  the  estimator  error 
probability,  Pr(]qK  -  pij \  >  e),  for  e  >  0.  We  now  argue  how  this  can  be  done,  for  the 
estimator  qK  in  Cases  2  and  3  of  Section  3.3.  To  keep  the  notation  simple  we  omit  the 
super-script  (n)  . 

For  q/j  as  in  Case  3  of  Section  3.3,  and  k*  >  0 

Vij  +  k*  -  J(vij  +  k *)2  -  4k* hip 
Qij  ~  Pij  >  e  <=>  - ^ - >  Pi +  e 


<^=>  +  k*  -  yj(vij  +  k*)2  -  4k*hjj  >  2 k*{pUj  +  e) 

»  (Vij  +  k*-  2k*(Pjj  +  6))2  >  (vij  +  k*)2  -  4k*hij 


-4k* (pu  +  e){vu  +  k*)  +  (2k* (pU]  +  e))2  >  -4 k*hL 


Hence, 


Kj  -  ( Pij  +  e)(v Uj  +  k*)  +  k*(pij  +  e)~  >  0 


k*  < 


Kj  ~  VijiPiJ  +  e) 


( Pij  +  e)  -  ( Pij  +  e ) 


2  ' 


Pr(qUj  -  p^  >  e)  =  Pr  k  < 


Kj  -  Vijipu  +  e) 


( Pij  +  e)  ~  ( Pij  +  K2 
<  Pr  [Kj  -  Vjjipij  +  e)  >  0)  <  exp(-2 vue2). 


the  last  inequality  by  Hoeffding’s  Lemma. 


(3.3) 

(3.4) 


35 


The  same  proof  technique  applies  for  k*  <  0,  leading  to 


Pr(qjj  -pUj  <  -e)  <  exp(-2 vUje2). 


Regarding  the  estimator  of  Case  2  without  rescaling  (c.f.,  Equation  3.1)  in  Section  3.3  and 
using  the  same  approach  we  get 

Pr(qij  ~Pij  >  e)  =  Pr  -  pLj  >  ej  =  Pr  (hUj  -  Vijiptj  +  e)  >  0)  <  exp(-2 i^e2), 

and  similarly  in  the  case  of  k*  <  0,  so  we  end  up  with  the  same  bound  as  Case  3. 

From  Equations  3.3  and  3.4  we  see  that  the  MLE  approach  is  less  conservative  than  the 
estimator  obtained  by  only  considering  hits,  which  is  further  supported  by  the  numerical 
work  in  Chapter  4.  A  more  refined  proof  technique  is  needed  to  flesh  out  the  gain  derived 
by  including  the  miss  observations,  but  this  is  left  as  future  work. 

From  here  the  classical  MAB  index  follows, 


for  each  i  e  X.  A  constant  larger  than  1.5  leads  to  more  exploration  than  needed,  while  the 
opposite  is  true  for  a  constant  smaller  than  1.5;  see  [20]  for  a  derivation.  The  intuition  behind 
Equation  3.5  is  that  each  sub-optimal  state  is  sampled  logarithmically  with  the  number  of 
views  from  the  source  state,  while  the  best  state  (i.e.,  the  mode)  is  sampled  the  rest  of  the 
time. 


Given  that  the  target  was  just  observed  in  state  i,  the  MAB  algorithm  proceeds  by  placing 
the  sensor  in  the  state  with  the  largest  index, 

argmax  qtJ  + 
j= f-T 

It  is  well-known  (see  Theorem  1  in  [20])  that  this  approach  produces  an  expected  regret 
with  an  upper  bound  that  is  logarithmic  with  the  number  of  views. 
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To  complete  the  picture,  for  the  estimator  of  Case  2  with  rescaling  (c.f.,  Equation  3.2)  in 
Section  3.3  we  get 


1  ht  . 

h.e\ 

-l 

) 

Pr^i,j~Pi,j  >  =  Pr 

i>j 

i,t 

~ pi, j 

>  e 

) 

-  Pr 

h.  . 

hj 

VU 


> 


1  -  ( Pij  +  e) 


( Pij  +  e )  ’ 


after  working  inside  the  parentheses.  Using  standard  arguments,  the  right-hand  side  above 
becomes 


<  Pr 


h.  . 

hj 


v.  . 

hj 


h. 


ue 


7/ 


<  1 


<  Pr 


h. 


VU 


<  Pu  ~ 


e 


<  exp(-2 vt j62 


This  bound  can  be  used  to  produce  an  index  similar  to  Equation  3.5,  but  with  higher  regret 
due  to  the  larger  upper  bound  on  the  probability  of  error.  However,  the  numerical  results  in 
the  next  chapter  suggest  that  the  rescaled  MLE  has  smaller  regret  (i.e.,  better  performance) 
than  the  unsealed  MLE,  suggesting  that  the  inequalities  above  are  too  loose. 

As  a  final  note,  while  in  this  section  we  only  considered  a  searcher  with  a  single  sensor, 
the  developments  can  be  extended  to  a  multiple-sensor  setting  following  the  ideas  of  Chen, 
Wang,  and  Yuan  in  their  “Combinatorial  Multi-Armed  Bandit:  General  Framework,  Results 
and  Applications”  [21],  who  study  the  problem  of  how  to  optimally  pull  several  arms 
simultaneously. 
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CHAPTER  4: 

Analysis  and  Simulation  Results 


In  this  chapter,  we  examine  the  initial  simulation  results,  validating  the  analytic  bounds 
calculated  in  Chapter  3.  Specifically,  in  a  couple  small  examples  we  see  eight  to  ten  times 
faster  convergence  rates  when  compared  to  the  naive  approach.  Of  note,  we  will  examine 
this  comparison  between  the  Naive  method  and  our  Single-Miss  Algorithm  throughout  this 
chapter.  Here  is  a  quick  outline  or  summary: 

1.  Tiny  System  (4  States) 

2.  Small  System  (10  States) 

To  simulate  our  algorithm’s  performance  against  the  Naive  method,  we  leverage  MATLAB 
to  simulate  the  target’s  behaviour,  based  on  each  specific  setting’s  transistion  probabilities, 
and  then  implement  both  methods  to  attempt  to  learn  the  target’s  behaviour  pattern  or 
transition  probabilities.  Specifically,  we  capture,  at  each  time  step  n,  the  estimated  or 
empirical  probability  of  the  target  transitioning  from  state  1  to  state  4,  or  in  notation, 

Further,  we  break  the  Naive  method  into  two  separate  versions  shown  in  the  plots  as 
Naive  with  normalization  and  Naive  without  normalization.  The  first,  as  its  name  implies, 
normalizes  each  row  of  the  transition  probability  matrix  at  each  time  step  while  the  second 
does  not  (this  breaks  the  law  of  total  probability  but  since  the  algorithm  only  needs  the  row 
of  index  values  to  determine  the  next  sensor  allocation  the  algorithm  still  functions).  The 
non-normalized  version  instead,  plots  the  raw  ratio  of  hits  to  attempted  observations  or  in 

7  (n) 

(ri)  • 

notation  it  defines  the  estimated  probability  as:  q.  .  =  The  reason  for  this  becomes 

VUj 

apparent  in  Section  4.1.  Namely,  the  normalized  version  consistently  over-estimates  the 
target’s  probabilites  since  the  normalization  process  we  use  treats  each  estimated  ratio  as 
equally  precise. 

In  both  Sections  4.1  and  4.2,  MATLAB  was  used  to  simulate  the  bahaviour  of  the  target 
using  the  standard  built-in  random  number  generator  with  each  method  being  replicated 
100  times  per  method.  The  mean  of  these  100  simulations  was  then  used  to  generate  the 
95%  Confidence  Interval  bounds  denoted  by  the  thin  colored  lines  in  the  following  figures. 
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4.1  Tiny  System  (Four  States) 

In  this  section  we  examine  simulated  convergence  rates  of  both  the  basic  Naive  and  our 
proposed  Single-Miss  MLE  methods  on  a  very  small  and  dense  example.  Specifically,  we 
examine  a  four  state  system  corresponding  to  a  potential  terrorist’s  behaviour  within  a  city. 
The  four  states  being  his  house,  a  cafe,  his  workplace,  and  a  store.  His  transition  kernel  or 
transition  probability  matrix  is  defined  as: 


house  (1) 

work  (2) 

store  (3) 

cafe  (4) 

house  (1) 

0.10 

0.20 

0.30 

0.40 

work  (2) 

0.90 

0.02 

0.04 

0.04 

store  (3) 

0.70 

0.03 

0.15 

0.12 

cafe  (4) 

0.82 

0.06 

0.02 

0.10 

Since  we  are  looking  to  learn  the  behaviour  pattern  of  our  target,  we  measure  our  per¬ 
formance  on  the  mode  of  a  row.  In  this  case,  we  compare  how  quickly  the  methods  can 
estimate  the  house  to  cafe  transition  probability,  in  notation  the  Pl4  =  0.4  probability.  Of 
note,  the  horizontal  axis  in  Figures  4.1  through  4.4  is  discrete  time.  In  Figures  4.1  and  4.3, 
the  vertical  axis  is  probability  with  the  thick  black  line  marking  the  true  probability,  0.4. 


Averaged  Estimated  p1  4  vs  time 
{  N  =  1  k  steps  with  1 00  replications  on  4  state  system  ) 


Figure  4.1.  Four  State  Estimated  Transition  Probabilities  vs.  Time  (mean 
with  95%  confidence  intervals).  Generated  in  MATLAB. 
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As  can  be  seen  in  Figure  4.1,  our  Single-Miss  MLE  method’s  95%  Confidence  Interval 
captures  the  true  probability  well  before  time  step  n  =  50,  the  Naive  without  normalization 
method  captures  it  around  time  step  n  =  500,  and  the  Naive  with  normalization  method 
doesn’t  capture  it  until  well  after  time  step  n  =  1000.  A  quick  calculation  shows  that  our 
Single-Miss  MLE  method  learns  at  least  ten  times  faster  than  the  better  of  the  naive  versions. 
Further,  we  can  also  see  that  the  normalized  Naive  version  does  in  fact  over-estimate  our 
parameter  of  interest.  This  interesting  finding  is  further  examined  in  Section  4.2. 


Additionally,  we  can  see  from  Figure  4.2  that  we  also  attain  logarithmic  learning  rates 
with  both  MLE  methods.  At  the  same  time  though,  the  Single-Miss  MLE  method  clearly 
outperforms  both  Naive  versions  in  both  the  expected  cumulative  regret  as  well  as  the  95% 
Confidence  Bounds  on  that  mean.  For  this  analysis,  we  define  the  expected  regret  (vertical 
axis  in  Figures  4.2  and  4.4)  as  the  absolute  difference  between  the  true  probability  and  the 
current  estimate  qin> ,  with  these  differences  being  summed  at  every  time  step.  In  notation: 


N 

which  for  state  1  to  4  is:  ^  -  0.4 

n- 1 


(4.1) 


Average  Cumulative  Regret  ( |  true  -  est  | )  p1  4  vs  time 
(  N  =  1  k  steps  with  100  replications  on  4  state  system ) 


Figure  4.2.  Four  State  Expected  Regret  vs.  Time  (mean  with  95%  confi¬ 
dence  intervals).  Generated  in  MATLAB. 
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4.2  Small  System  (Ten  States) 

In  this  section,  we  expand  the  state  space  from  four  to  ten  states.  This  might  correspond  to 
adding  another  location,  maybe  a  bar,  and  splitting  each  of  these  five  states  into  a  day  and 
night  version. 

Below  is  a  randomly  generated  transition  kernel  or  transition  probability  matrix  for  the 
target’s  behaviour  modeled  with  ten  states.  The  result  of  applying  the  algorithms  to  this 
behaviour  pattern  are  noted  in  Figures  4.3  and  4.4.  Of  note  in  the  below  behaviour  matrix, 
the  /?|  4  probability  is  intentionally  set  to  0.4  for  ease  of  comparison  to  the  four  state  system 
analyzed  in  Section  4.1. 


1 

2 

3 

4 

5 

6 

7 

8 

9 

to 

1 

0 

0.16 

0 

0.40 

0.13 

0.10 

0 

0.10 

0 

0.10 

2 

0 

0.23 

0.16 

0 

0 

0.18 

0 

0.20 

0 

0.22 

3 

0 

0 

0.23 

0.22 

0 

0 

0.28 

0.26 

0 

0 

4 

0 

0 

0 

0 

0 

0 

0.53 

0 

0 

0.46 

5 

0.30 

0.11 

0 

0.11 

0.13 

0.12 

0 

0 

0.13 

0.09 

6 

0.40 

0 

0 

0 

0 

0.16 

0.16 

0 

0.10 

0.17 

7 

0.32 

0 

0.11 

0.12 

0.16 

0 

0.11 

0 

0.17 

0 

8 

0.64 

0 

0 

0.12 

0 

0.11 

0 

0 

0.12 

0 

9 

0.45 

0.13 

0 

0 

0.11 

0 

0 

0.10 

0.09 

0.11 

10 

0.61 

0 

0.12 

0 

0 

0.13 

0 

0 

0.13 

0 

Since  we  are  looking  to  learn  the  behaviour  pattern  of  our  target,  we  again  measure  our 
performance  on  the  mode  of  the  first  row.  In  this  case,  we  compare  how  quickly  the  methods 
can  estimate  the  state  1  to  state  4  transition  probability,  or  in  notation  p1 4  =  0.4.  The  figures 
in  this  section  following  the  same  format  as  the  previous  section,  namely  the  horizontal  axis 
being  discrete  time.  The  vertical  axis  in  Figure  4.3  is  probability  and  the  vertical  axis  in 
Figure  4.4  is  the  cumulative  regret  as  defined  in  Equation  4. 1 . 
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Averaged  Estimated  p1 4  vs  time 
(  N  =  2k  steps  with  100  replications  on  10  state  system  ) 


Figure  4.3.  Ten  State  Estimated  Transition  Probabilities  vs.  Time  (mean 
with  95%  confidence  intervals).  Generated  in  MATLAB. 

Again,  we  see  the  estimation  power  of  the  Single-Miss  method  continues  to  outperform 
both  Naive  versions  in  Figure  4.3.  In  this  setting  though,  the  effect  of  learning  from  miss 
data  is  less  pronounced.  Knowing  that  the  target  did  not  transition  to  a  given  state,  a  miss, 
provides  less  information  than  a  miss  does  in  the  four  state  system,  since  the  probability 
density  is  distributed  across  nine  states  in  this  case  vice  just  the  three  from  Section  4.1. 

Figure  4.3,  highlights  even  more  strongly  the  fact  that  the  normalized  Naive  version  remains 
over-inflated  for  a  very  long  period  of  time.  This  is  due  to  the  normalization  process  failing 
to  account  for  the  different  precision  of  each  ratio.  Since  the  “sub-optimal”  arms  are  looked 
at  far  less  frequently  (think  far  smaller  sample  size),  their  hits  versus  attempted  views  ratios 
are  by  nature  less  precise  (have  higher  variance)  than  the  more  “optimal”  arms.  The  intuition 
here  is  that  as  a  sample  size  increases  we  decrease  the  variance  of  our  estimated  parameter, 
a  specific  transition  probability  in  this  case.  Therefore,  if  we  normalize  without  some  form 
of  a  weighting  scheme  tied  to  the  different  arms’  precisions  we  end  up  over-inflating  the 
mode  as  can  be  clearly  seen  in  Figure  4.3. 
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Average  Cumulative  Regret  ( |  true  -  est  | )  p1  4  vs  time 
(  N  =  2k  steps  with  100  replications  on  10  state  system  ) 


Figure  4.4.  Ten  State  Expected  Regret  vs.  Time  (mean  with  95%  confidence 
intervals).  Generated  in  MATLAB. 

Figure  4.4  shows  that  the  logarithmic  learning  rates  or  cumulative  expected  regrets  are  still 
achieved  as  the  state  space  increases.  We  also  see  the  very  pronounced  effect  of  over¬ 
inflation  on  the  cumulative  regret  for  the  normalized  Naive  method.  And  lastly,  we  again 
see  that  the  Single-Miss  MLE  method  continues  to  outperform  both  Naive  versions. 
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CHAPTER  5: 

Conclusion  and  Recommendations 


This  chapter  provides  a  quick  summary  of  or  conclusion  to  this  research  as  well  as  exploring 
some  areas  of  recommended  future  research  or  work. 


5.1  Conclusions 

This  thesis  presents  a  novel  method  for  leveraging  and  applying  Machine  Learning  methods 
and  techniques  to  learning  behaviour  patterns  modeled  as  discrete  time  Markov  Chains. 
Further,  the  Single-Miss  MLE  algorithm  obtains  a  logarithmic  learning  rate  and  performs 
significantly  better  in  expectation  than  the  Naive  methods  described  in  Chapters  3  and  4 
in  the  scenarios  examined  and  presented  in  Chapter  4.  At  the  same  time  though,  given  an 
appropriate  exploration  inflation  term,  examined  and  developed  in  Chapter  3,  the  Single- 
Miss  MLE  and  Naive  MLE  methods  all  obtain  the  desired  logarithmic  rate. 

Chapter  1  laid  out  the  background  to  this  thesis  and  examined  some  different  areas  of 
research  that  intersect  in  an  interesting  way.  Specifically,  leveraging  current  Stochastic 
Multi-Armed  Bandit  theory  as  well  as  modeling  a  target’s  behaviour  as  a  discrete  time 
Markov  Chain.  The  intersection  of  these  two  fields  of  research  in  our  problem  and  our 
resulting  Single-Miss  MLE  algorithm  is  novel  as  far  as  the  writers  are  aware. 

5.2  Future  Work 

This  section  explores  some  proposed  future  work  or  extensions.  These  ideas  being  naturally 
tied  to  or  flowing  from  assumptions  and  limitations  mentioned  in  Chapters  1  and  3. 

Leveraging  Multi-Step  Misses 

The  convergence  rate  gains  obtained  by  leveraging  the  complimentary  miss  data,  namely 
the  Single-Miss  approach  versus  the  Naive  approach,  resulted  in  a  significant  performance 
increase.  Therefore,  extracting  additional  data  from  the  multi-step  miss  sequences  or  strings 
should  likewise  yield  additional  performance  improvements. 
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Impact  of  Density  and  Increased  State  Space  Size 

As  Chapter  4  highlights,  the  increased  state  space  size  resulted  in  a  decreased  effect  or 
benefit  from  gleaning  information  from  miss  data.  This  indicates  that  further  analysis 
should  be  conducted  exploring  the  impact  or  effect  of  the  state  space’s  size  and  resultant 
transition  probability  matrix  density  on  the  learning  rate. 

Leveraging  a  Bayesian  Prior 

Depending  on  how  one  defines  the  state  space,  some  transitions  may  be  completely  im¬ 
possible  while  others  are  just  highly  unlikely.  An  example  of  the  former  is  a  potential 
terrorist  transitioning  from  a  cafe  at  8  am  to  the  airport  at  midnight  in  a  single  one  hour 
time  step  which  is  impossible.  Others  may  be  highly  unlikely  based  on  current  technology 
or  some  prior  analysis  of  the  target’s  behaviour  by  the  searcher.  These  therefore,  provide 
the  motivation  for  enabling  or  leveraging  a  Bayesian  approach  to  this  problem  which  would 
enable  this  prior  knowledge  to  be  captured. 


Noisy  Sensor  Responses 

What  if  the  sensor  returns  a  noisy  response,  think  false  negatives  or  positives?  This  might 
correspond  to  a  patrol  officer  walking  by  the  outside  of  a  cafe  during  the  morning  rush.  The 
patrol  officer  glances  through  the  window  but  is  unable  to  see  everyone  in  the  cafe.  The 
officer  then  must  give  some  sort  of  estimate  of  their  certainty  of  the  target’s  absence  from 
the  cafe.  The  converse  could  also  occur  where  a  sensor  thinks  it  observed  the  target  but  in 
fact  did  not.  Implementing  noisy  responses  from  the  sensor  would  decrease  the  abstraction 
of  this  problem  and  hence,  increase  the  applicability  of  the  resultant  algorithm. 

Multiple  Sensors 

Given  a  single  target,  what  if  the  Law  Enforcement  Agency,  the  searcher,  had  multiple 
sensors  to  deploy  or  allocate?  This  concept  could  provide  an  interesting  extension  to  the 
current  Single-Miss  MLE  method  where  the  number  of  sensors  is  still  less  than  the  total 
number  of  observable  states. 
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Multiple  Related  Targets 

What  if  instead  of  multiple  sensors,  there  were  multiple  targets  that  were  somehow  either 
spatially  correlated  or  dependent?  This  would  enable  the  algorithm  to  leverage  data  from 
multiple  targets,  potentially  increasing  the  learning  rate  on  both  targets. 

Time-Horizon  Weighting 

One  key  assumption  of  this  thesis  is  that  the  target’s  behaviour  pattern  remains  stationary.  In 
other  words,  the  target’s  behaviour  does  not  change  with  time.  What  if  we  relax  this  assump¬ 
tion  and  allow  the  target’s  behaviour  to  vary  with  time?  This  motivates  the  development  of 
a  time-horizon  weighting  scheme  or  method  which  values  more  recent  observations  over 
older  data.  This  could  also  then  be  extended  to  form  the  basis  for  a  change  point  detection 
algorithm. 
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APPENDIX  A: 

Basic  Mathematical  Notation 


While  not  absolutely  necessary,  this  appendix  provides  a  quick-reference  guide  to  the 
notation  we  use  in  Chapters  3  and  4. 


Sets 

l,  i,  j  €  X  =  { 1, 2, . . . ,  L}.  Let  L  =  |£|  (total  number  of  different  states).  And,  let  £~  be  the 
state  space  complement  of  £  or  all  other  states  in  X. 

n  €  N  =  { 1, 2, . . .},  being  the  current  discrete  time  step.  We  also  use  k  as  a  past  time  step. 

Data 

T'\  the  target’s  current  location  or  state  at  time  step  n  in .  the  current  time  step). 

P,  true,  underlying,  transition  probability  matrix  governing  the  target’s  transition  behavior. 

Pij,  target’s  probability  of  transitioning  from  state  i  to  j,  corresponding  to  the  (/,_/)’ th 
element  within  the  matrix  P. 


Variables 

Sn,  sensor’s  current  location  or  where  it  was  sent  at  time  n  ( n ,  the  current  time  step). 
v..,  cumulative,  by  n.  number  of  sensor  allocations  to  j,  given  the  target  came  from  i. 

l’J 

(n) 

hkj  ,  cumulative,  by  n,  number  of  target  observations  in  j,  given  the  target  came  from  i. 

q^n),  target’s  estimated,  as  of  n,  probability  of  transitioning  from  state  i  to  j  corresponding 
to  the  (i,  j)’th  element  within  the  matrix  Qink  defined  next. 


<2("\  current,  by  n,  empirical  transition  probability  matrix  estimating  the  true  behavior. 


(n) 

K;  .  ,  cumulative  regret,  by  n.  between  the  estimated  and  true  transition  probabilities  for  the 


transition  i,j.  Specifically,  we  define  this  regret  as:  K 


n) 

tj 


Ek'f  -Pi 


k=  1 
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APPENDIX  B: 

Simple  Markov  Chain  Example 


While  this  appendix  also  is  not  completely  necessary,  it  is  included  to  serve  as  a  very  quick 
introduction  to  discrete  time  Markov  chains  for  those  readers  who  do  not  have  exposure  to 
these  concepts  and  ideas.  Towards  this  goal,  we  take  a  moment  to  define  a  Markov  Chain 
for  the  non-technical  reader  as  well  as  provide  a  simple  example  to  facilitate  understanding. 

Layman’s  Definition:  A  Markov  Chain  is  a  set  of  states  or  state  space  with  associated 
transition  probabilities  that  models  an  object’s  stochastic  or  uncertain  behaviour  over  time. 
It  assumes  that  future  transitions  are  only  dependent  on  the  current  state  of  the  object  and 
therefore  independent  of  the  past.  We  call  this  independence  the  memoryless  property 
of  a  Markov  Chain  or  the  Markov  Property.  While  this  may  seem  like  a  rather  large 
assumption  to  make,  if  needed,  we  can  embed  information  about  the  past  into  the  current 
state  thereby  maintaining  this  assumption  without  losing  the  mathematical  power  of  the 
memoryless  property  of  Markov  Chains.  The  following  is  a  more  succint  definition  using 
mathematical  notation: 

Mathematical  Definition:  A  Discrete  Time  Markov  Chain  is  a  stochastic  (think  uncertain) 
process  Xt  :  t  =  1,2, .. .  taking  values  in  a  discrete  state  space  S  =  {1, 2, . . . ,  s},  that 
satisfies  the  Markov  Property.  Meaning,  for  A  c  S: 


P(Xt+ 1  €  A  |  XU  . . . ,  Xt)  =  P(Xt+ 1  eA\Xt) 


Example:  A  Washing  Machine 

The  simplest  example  is  a  washing  machine.  It  is  either  working  or  broken,  up  or  down, 
with  these  states  defining  its  state  space.  The  set  {up,  down}  being  this  state  space.  We 
model  the  washing  machine  as  a  stochastic  process  {Xht  =  0, 1, . . .}  with  Xt  taking  on 
the  value  of  either  up  or  down  in  each  time  period  t.  Generically,  if  Xt  =  x,  then  the 
process  or  machine  is  said  to  be  in  state  x  at  time  t.  Further,  we  define  the  finite  state  space 
as  S  =  { 1, . . . ,  .v}  which  corresponds  to  {up,  down}  in  our  example.  Next,  we  define  the 
fixed  probability  of  the  machine  transitioning  from  its  current  state  in  this  time  period  to 
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another  state  (or  back  to  the  same  state)  via  pij .  In  mathematical  and  probability  notation: 
P(Xt+ 1  =  j  |  Xt  =  i)  =  pij.  Therefore,  if  the  machine  is  currently  up,  then  we  indicate 
the  probability  of  it  going  down  in  the  next  time  period  as  plip,down •  We  also  denote  the 
probability  of  the  machine  staying  up  as  pup,uP •  Further,  by  the  Law  of  Total  Probability, 
Pup.up  +  Pup, down  =  1  or  in  layman’s  terms,  the  machine  must  be  either  up  or  down  in  the 
next  time  period.  This  last  can  be  expressed  in  the  following  two  properties: 

1.  For  all j ;  0  <  ptj  <  1  (Probabilities  must  be  between  zero  and  one) 

m 

2.  For  alii;  YjPij-  1  (Law  of  Total  Probability) 

j= i 

Further,  we  can  mathematically  express  the  entire  transition  dynamics  of  our  washing  ma¬ 
chine  process  as  a  matrix  of  probabilities.  We  use  the  same  notation  with  pi  j  corresponding 
to  the  fth,  j’th  element  of  our  2-dimensional  transition  probability  matrix,  P.  In  our 
example  it  could  look  like  the  following: 


p  = 

Pup.up 

Pup.down 

_ 

".75 

.25" 

Pdown.up 

Pdown.down 

.80 

.20 

The  above  matrix  captures  the  behavior  of  our  simple  washing  machine  example.  But  as 
you  are  no  doubt  thinking,  all  breakdowns  are  not  equal.  Some  might  be  for  minor  damage 
while  some  might  be  of  a  much  more  extensive  nature.  This  can  also  be  captured  by  a 
Markov  Chain,  as  can  be  seen  in  the  below  modified  matrix.  This  method  of  expanding  the 
state  space  to  capture  more  detail  is  very  useful  and  will  enable  us  to  apply  the  algorithms 
developed  within  this  thesis  to  a  much  broader  set  of  problems  than  would  seem  possible 
at  first  glance.  As  can  be  seen,  we  now  have  a  couple  zero  probabilities  that  correspond  to 
the  machine  breaking  down  due  to  minor  damage  and  then  somehow  becoming  broken  for 
major  damage  (whether  this  is  zero  or  not  though  ultimately  depends  on  the  system  being 
modeled). 


Pup.up 

Pup,downmjn 

Pup,downmaj 

".75 

.20 

.05" 

Pmod  ~~ 

Pdownmin,up 

Pdownmin,downmi„ 

Pdownmin,downmaj 

— 

.80 

.20 

0.0 

Pdownmaj,up 

Pdownmaj,downmin 

Pdownmaj.down.maj  _ 

.40 

0.0 

.60 
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